Chapter 5: String Parsing

  1. Introduction
  2. Parser Combinators vs. Regular Expressions
  3. Why Choose Parsec?
  4. Parsec Syntax Cheat Sheet
  5. Building More Complex Parsers
  6. Advanced Error Handling in Parsec
  7. Parse, Don't Validate
  8. Recap & Exercises

1. Introduction

Parsing is the process of analyzing a string or a sequence of characters to extract meaningful information or transform it into a structured format. While regular expressions (regex) are a common tool for parsing, they come with significant limitations, particularly in terms of readability, maintainability, and safety.

In this lesson, we’ll explore Parsec, a powerful library for parsing in Haskell. We’ll start by understanding why Parsec is superior to regular expressions and then move on to practical examples of building parsers.

2. Parser Combinators vs. Regular Expressions

Regular expressions are widely used for string parsing due to their compact syntax and ability to match patterns. However, they have several limitations:

Lack of Type Safety

Example:

import re

pattern = r"(d{4})-(d{2})-(d{2})"
match = re.match(pattern, "2024-12-29")

if match:
    year, month, day = match.groups()
else:
    print("Invalid date")

If the input string doesn’t match the pattern, this code will fail without meaningful feedback on why the match failed.

In contrast, Parsec provides compile-time guarantees by working with typed parsers, ensuring that the input conforms to the expected structure.

Poor Readability and Maintainability

Example:

pattern = r"^([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,})$"

What does this pattern do? While it matches an email address, understanding or modifying it requires deep knowledge of regex syntax.

The regular expression above handles the simplest of email patterns, but as laid out in RFC5322, email patterns can be very complex. The following regular expression is an example of a regular expression used for email address format validation in RFC822:

(?:(?:rn)?[ t])*(?:(?:(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*))*|(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)*<(?:(?:rn)?[ t])*(?:@(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*))*(?:,@(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*))*)*:(?:(?:rn)?[ t])*)?(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*))*>(?:(?:rn)?[ t])*)|(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)*:(?:(?:rn)?[ t])*(?:(?:(?:[^()<>@,;:\".[]000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*))*|(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)*<(?:(?:rn)?[ t])*(?:@(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*))*(?:,@(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*))*)*:(?:(?:rn)?[ t])*)?(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*))*>(?:(?:rn)?[ t])*)(?:,s*(?:(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*))*|(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)*<(?:(?:rn)?[ t])*(?:@(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*))*(?:,@(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*))*)*:(?:(?:rn)?[ t])*)?(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*))*>(?:(?:rn)?[ t])*))*)?;s*)

You can see how this becomes wildly complicated and prone to all kinds of logic errors and typos. Not to mention that standards change every couple of years. While there may be popular libraries like email-validator to handle this (assuming the developer who wrote the library also didn't make any mistakes), imagine trying to parse complex data that doesn't have a common library. The issue persists.

Parsec, on the other hand, allows you to build parsers in a compositional and modular way, where each part of the parser has a clear, readable purpose.

Here is an example of the same parser but done with a Parser Combinator instead:

import Text.Parsec
import Text.Parsec.String (Parser)
import Control.Monad (void)
import Data.Char (isAlphaNum, isAscii)

-- Helper: Parse a specific character or fail
charP :: Char -> Parser Char
charP c = char c <?> [c]

-- Helper: Parse a range of allowed characters
satisfyP :: (Char -> Bool) -> Parser Char
satisfyP f = satisfy f <?> "allowed character"

-- Helper: Parse whitespace
whitespace :: Parser ()
whitespace = void $ many (oneOf " trn")

-- 1. Local Part Parsers (Before @)
localPart :: Parser String
localPart = try quotedString <|> unquotedLocalPart

quotedString :: Parser String
quotedString = do
    charP '"' -- Opening quote
    content <- many (escapedChar <|> satisfyP (/= '"'))
    charP '"' -- Closing quote
    return content

unquotedLocalPart :: Parser String
unquotedLocalPart = many1 (satisfyP isValidLocalChar)

isValidLocalChar :: Char -> Bool
isValidLocalChar c = isAlphaNum c || c `elem` "!#$%&'*+-/=?^_`{|}~."

escapedChar :: Parser Char
escapedChar = charP '\' >> anyChar

-- 2. Domain Parsers (After @)
domain :: Parser String
domain = try domainLiteral <|> domainName

domainLiteral :: Parser String
domainLiteral = do
    charP '['
    content <- many (escapedChar <|> satisfyP (/= ']'))
    charP ']'
    return content

domainName :: Parser String
domainName = do
    first <- many1 (satisfyP isAlphaNum)
    rest <- many (charP '.' >> many1 (satisfyP isAlphaNum))
    return $ first ++ concatMap ('.' :) rest

-- 3. Full Email Parser
emailParser :: Parser (String, String)
emailParser = do
    local <- localPart
    charP '@'
    domain <- domain
    return (local, domain)

-- Example Usage
main :: IO ()
main = do
    let testEmails =
            [ "user@example.com"
            , ""quoted@name"@example.com"
            , "user@[192.168.1.1]"
            , "complex.local-part+extra@sub.domain.com"
            ]
    mapM_ (printResult . parse emailParser "") testEmails

printResult :: Either ParseError (String, String) -> IO ()
printResult (Left err) = putStrLn $ "Parse error: " ++ show err
printResult (Right email) = putStrLn $ "Parsed: " ++ show email

As you can see, it's much easier to read and to maintain. It also allows us to specify types for everything and parse into custom types, which we will explore in more detail later in this chapter.

Ambiguity and Non-Determinism

Example:

pattern = r"(a+)+"
re.match(pattern, "aaaaaaaa")
# Depending on the engine, this can cause excessive backtracking.

Parsec parsers are deterministic by design. Each parser consumes the input step-by-step, ensuring there’s no ambiguity or unnecessary backtracking.

Performance Considerations

The runtime performance of a regular expression can be faster than a parser combinator in simple scenarios, but there are many scenarios where a regular expression cannot handle parsing in a performant manner, and often times it will time-out. Regular expressions are not good at handling nested data structures.

Example: Parsing Nested Mathematical Expressions

We want to parse and evaluate nested mathematical expressions such as:

Regex Implementation Regex struggles with parsing nested structures. While you can write a regex pattern to match numbers and simple operators, handling parentheses correctly is nearly impossible.

Example regex for basic math expressions:

import re

# Regex for simple math expressions
pattern = r"^d+(s*[+-*/]s*d+)*$"

# Match without considering nesting
def is_valid_math(input):
  return bool(re.match(pattern, input))

print(is_valid_math("2 + (3 * 4)")) # Fails
print(is_valid_math("2 + 3 * 4")) # Passes

This regex:

  1. Cannot handle nested parentheses.
  2. Fails for valid expressions like "2 + (3 * 4)".
  3. Would require impractical workarounds to support nesting, leading to catastrophic backtracking.

Parsec Implementation With Parsec, we can elegantly handle nested structures.

import Text.Parsec import Text.Parsec.String (Parser)

-- Define a parser for mathematical expressions
data Expr
  = Num Int
  | Add Expr Expr
  | Sub Expr Expr
  | Mul Expr Expr
  | Div Expr Expr
  deriving Show

-- Parse an integer
parseNumber :: Parser Expr
parseNumber = do
  num <- many1 digit
  return $ Num (read num)

-- Parse parentheses
parseParens :: Parser Expr
parseParens = do
  char '('
  expr <- parseExpr char ')'
  return expr

-- Parse operators and combine expressions
parseTerm :: Parser (Expr -> Expr -> Expr)
parseTerm = (char '+' >> return Add)
  <|> (char '-' >> return Sub)
  <|> (char '*' >> return Mul)
  <|> (char '/' >> return Div)

-- Parse a full expression
parseExpr :: Parser Expr
parseExpr = do
  left <- parseNumber <|> parseParens
  rest <- many (do
    op <- parseTerm
    right <- parseNumber <|> parseParens
    return (op right))
  return $ foldl (acc f -> f acc) left rest

-- Test the parser
main :: IO ()
main = do
  let inputs = ["2 + (3 * 4)", "(1 + 2) * (3 - 4)", "10 / (5 + 5)", "2 + (3 * 4"]
  mapM_ (printResult . parse parseExpr "") inputs

printResult :: Either ParseError Expr -> IO ()
printResult (Left err) = putStrLn $ "Invalid: " ++ show err
printResult (Right expr) = putStrLn $ "Valid: " ++ show expr

Why Parsec Excels

  1. Handles Nesting:
  1. No Backtracking Issues:
  1. Error Messages:

Benchmark Comparison Parsing 1,000,000 mathematical expressions:

ImplementationInputRuntime
Regex"2 + (3 * 4)"Timeout
Parsec"2 + (3 * 4)"~20ms

3. Why Choose Parsec?

Type Safety

Parsec operates with typed parsers that clearly define what they parse and what they return. For example:

This type safety prevents common errors in parsing and makes your code more robust.

Example:

import Text.Parsec
import Text.Parsec.String (Parser)

integer :: Parser Int
integer = read <$> many1 digit

Here:

Composability

Parsec parsers are modular and composable, meaning you can combine small parsers into larger ones to handle complex patterns.

Example:

import Text.Parsec
import Text.Parsec.String (Parser)

dateParser :: Parser (Int, Int, Int)
dateParser = do
    year <- integer
    _ <- char '-'
    month <- integer
    _ <- char '-'
    day <- integer
    return (year, month, day)

This parser for dates is:

Error Reporting

Regular expressions provide minimal feedback when a match fails. Parsec parsers, on the other hand, produce detailed error messages, including:

Example:

import Text.Parsec
import Text.Parsec.String (Parser)

dateParser :: Parser (Int, Int, Int)
dateParser = do
    year <- integer
    _ <- char '-'
    month <- integer
    _ <- char '-'
    day <- integer
    return (year, month, day)

main :: IO ()
main = parseTest dateParser "2024-12"

Output:

parse error at (line 1, column 8): unexpected end of input expecting digit

This detailed feedback makes debugging easier and ensures your parsers are robust against unexpected inputs.

Deterministic Parsing

Parsec ensures that parsers are deterministic:

Example: Parsing a choice of patterns:

import Text.Parsec
import Text.Parsec.String (Parser)

parser :: Parser String
parser = try (string "hello") <|> string "help"

main :: IO ()
main = parseTest parser "help"

Here, the try function ensures that Parsec can test the first parser ("hello") without consuming input prematurely. This makes parsing unambiguous and efficient.

Summary of Advantages

FeatureRegexParsec
Type SafetyNone (operates on raw strings)Strong (works with typed parsers)
ReadabilityCryptic and hard to modifyModular and composable
Error ReportingMinimalDetailed and informative
DeterminismNon-deterministic with backtrackingDeterministic and predictable

4. Parsec Syntax Cheat Sheet

This table combines syntax and types for commonly used functions in Parsec, along with descriptions and examples for quick reference.

FunctionTypeDescriptionExample Usage
parseParser a -> SourceName -> String -> Either ParseError aRuns a parser on input and returns the result or an error.parse digit "" "123"
tryParser a -> Parser aAllows backtracking; tries the parser without consuming input on failure.try (string "hello") <|> string "help"
choice[Parser a] -> Parser aTries a list of parsers in sequence until one succeeds.choice [string "a", string "b", string "c"]
<|>Parser a -> Parser a -> Parser aChoice combinator: Tries the first parser, and if it fails, tries the second parser.string "yes" <|> string "no"
manyParser a -> Parser [a]Matches zero or more occurrences of a parser.many digit
many1Parser a -> Parser [a]Matches one or more occurrences of a parser.many1 letter
optionalParser a -> Parser ()Tries a parser and succeeds with no value if the parser fails.optional (char '-')
sepByParser a -> Parser sep -> Parser [a]Parses zero or more occurrences of a parser separated by another parser.digit `sepBy` char ','
sepBy1Parser a -> Parser sep -> Parser [a]Parses one or more occurrences of a parser separated by another parser.digit `sepBy1` char ','
endByParser a -> Parser sep -> Parser [a]Parses zero or more occurrences of a parser, each followed by a separator.digit `endBy` char ';'
endBy1Parser a -> Parser sep -> Parser [a]Parses one or more occurrences of a parser, each followed by a separator.digit `endBy1` char ';'
betweenParser open -> Parser close -> Parser a -> Parser aParses content between two other parsers.between (char '(') (char ')') digit
chainl1Parser a -> Parser (a -> a -> a) -> aParses left-associative chains (e.g., addition or subtraction).digit `chainl1` (char '+' >> return (+))
chainr1Parser a -> Parser (a -> a -> a) -> aParses right-associative chains (e.g., exponentiation).digit `chainr1` (char '^' >> return (^))
lookAheadParser a -> Parser aPeeks at input without consuming it.lookAhead (string "test")
notFollowedByParser a -> Parser ()Ensures a parser does not match at a given position.notFollowedBy (char 'x')
eofParser ()Ensures the end of input has been reached.digit >> eof
anyCharParser CharMatches any single character.anyChar
noneOf[Char] -> Parser CharMatches any character not in the given list.noneOf "aeiou"
oneOf[Char] -> Parser CharMatches any character in a given list.oneOf "aeiou"
charChar -> Parser CharMatches a specific character.char 'a'
stringString -> Parser StringMatches a specific string.string "hello"
digitParser CharMatches a single digit (0-9).digit
letterParser CharMatches a single alphabetic character (a-z, A-Z).letter
spacesParser ()Matches zero or more whitespace characters.spaces
<?>Parser a -> String -> Parser aAdds a label to a parser for better error messages.digit <?> "a digit"

This cheat sheet provides a comprehensive overview of the most useful Parsec functions, making it easier to reference and use them in your parsing projects.

More Parsec functions can be explored via the Hackage package explorer: https://hackage.haskell.org/package/parsec

5. Building More Complex Parsers

As we explore Parsec further, let’s build parsers that showcase the composability and power of parser combinators for real-world scenarios. These examples will illustrate how Parsec can handle structured data, nested inputs, and error reporting.

Example 1: Parsing Key-Value Pairs

Key-value pairs are a common structure in configuration files or query strings (e.g., key=value).

import Text.Parsec
import Text.Parsec.String (Parser)

-- Parse a key-value pair
keyValuePair :: Parser (String, String)
keyValuePair = do
    key <- many1 letter
    char '='
    value <- many1 (noneOf "n")
    return (key, value)

-- Parse multiple key-value pairs separated by newlines
keyValuePairs :: Parser [(String, String)]
keyValuePairs = keyValuePair `sepBy` char 'n'

main :: IO ()
main = do
    let input = "username=adminnpassword=1234ntimeout=30"
    case parse keyValuePairs "" input of
        Left err -> putStrLn $ "Parse error: " ++ show err
        Right pairs -> print pairs

Explanation:

  1. keyValuePair:
    • Parses a key (sequence of letters) followed by = and then a value (anything except a newline).
  2. keyValuePairs:
    • Combines multiple keyValuePair parsers using sepBy, which ensures pairs are separated by a newline.

Output:

[("username", "admin"), ("password", "1234"), ("timeout", "30")]

Example 2: Parsing Nested JSON-Like Data

Let’s parse a simplified JSON structure, focusing on key-value pairs with nested objects.

import Text.Parsec
import Text.Parsec.String (Parser)

-- A parser for keys
key :: Parser String
key = between (char '"') (char '"') (many1 letter)

-- A parser for string values
stringValue :: Parser String
stringValue = between (char '"') (char '"') (many (noneOf """))

-- A parser for key-value pairs
keyValue :: Parser (String, String)
keyValue = do
    k <- key
    char ':'
    v <- stringValue
    return (k, v)

-- A parser for objects (curly-brace enclosed key-value pairs)
jsonObject :: Parser [(String, String)]
jsonObject = between (char '{') (char '}') (keyValue `sepBy` char ',')

main :: IO ()
main = do
    let input = "{"name":"John", "age":"30", "city":"New York"}"
    case parse jsonObject "" input of
        Left err -> putStrLn $ "Parse error: " ++ show err
        Right obj -> print obj

Explanation:

  1. Key Parser:
    • Uses between to match keys enclosed in double quotes.
  2. Value Parser:
    • Parses string values similarly.
  3. Object Parser:
    • Parses key-value pairs enclosed in curly braces and separated by commas.

Output:

[("name", "John"), ("age", "30"), ("city", "New York")]

Example: Handling Errors in JSON Parsing

If the input is malformed (e.g., {"name":John}), Parsec produces:

Parse error at (line 1, column 8): unexpected "J" expecting """

This error clearly indicates where the issue lies (column 8) and what the parser expected (""").

Real-World Use Cases for Parsec

  1. Configuration Parsing: Parse complex configurations for applications, such as .ini files or .yaml files.
  2. Language Parsing: Build parsers for custom programming languages, scripts, or DSLs.
  3. Data Transformation: Parse and transform structured data formats like JSON, XML, or CSV.

6. Advanced Error Handling in Parsec

One of the strengths of Parsec is its ability to provide detailed and meaningful error messages. This is essential for debugging complex parsers or giving users helpful feedback when input is invalid.

Improving Error Messages with Labels

Parsec allows you to label parts of the parser using the <?> operator. This makes error messages more descriptive when parsing fails.

import Text.Parsec
import Text.Parsec.String (Parser)

-- A parser for integers
integer :: Parser Int
integer = read <$> many1 digit <?> "an integer"

-- A parser for comma-separated integers
commaSeparatedIntegers :: Parser [Int]
commaSeparatedIntegers = integer `sepBy` (char ',' <?> "a comma")

main :: IO ()
main = do
    let input = "123,456,abc,789"
    case parse commaSeparatedIntegers "" input of
        Left err -> putStrLn $ "Parse error: " ++ show err
        Right nums -> print nums

Explanation:

  1. The <?> operator adds a label to a parser. For example, integer <?> "an integer" makes it clear what kind of input the parser expects.
  2. When parsing fails, the error message includes the label.

Output for Invalid Input:

Parse error at (line 1, column 9): unexpected "a" expecting an integer

Using try to Handle Ambiguities

In cases where multiple parsers overlap, you can use the try combinator to backtrack and test another parser without consuming input prematurely.

import Text.Parsec
import Text.Parsec.String (Parser)

-- A parser for "hello" or "help"
greeting :: Parser String
greeting = try (string "hello") <|> string "help"

main :: IO ()
main = do
    let input = "help"
    case parse greeting "" input of
        Left err -> putStrLn $ "Parse error: " ++ show err
        Right result -> putStrLn $ "Parsed: " ++ result

Explanation:

  1. Without try, the string "hello" parser would consume part of the input, causing the string "help" parser to fail.
  2. The try combinator ensures that input isn’t consumed if the parser fails, allowing subsequent parsers to run.

Combining Parsers for Better Error Reporting

You can combine parsers and provide detailed error messages for complex input.

-- A parser for a custom command syntax
command :: Parser (String, String)
command = do
    cmd <- (string "run" <|> string "stop") <?> "a command ('run' or 'stop')"
    space
    arg <- many1 letter <?> "an argument"
    return (cmd, arg)

main :: IO ()
main = do
    let input = "run123"
    case parse command "" input of
        Left err -> putStrLn $ "Parse error: " ++ show err
        Right result -> print result

Output for Invalid Input:

Parse error at (line 1, column 4): unexpected "1" expecting a space

Benefits of Advanced Error Handling

  1. Clear Feedback:
    • Users can quickly identify what went wrong and where.
    • For developers, debugging is faster and more efficient.
  2. Input Validation:
    • Parsec ensures inputs are validated at every step.
    • Incorrect or incomplete inputs fail early, preventing downstream errors.
  3. Customizable Errors:
    • Using labels and combinators, you can make error messages user-friendly.

Real-World Example: Parsing a Simple DSL

Let’s build a parser for a small domain-specific language (DSL), like a configuration file.

Input:

server {
    host = "localhost"
    port = 8080
}

Parser:

import Text.Parsec
import Text.Parsec.String (Parser)

-- A parser for key-value pairs
keyValue :: Parser (String, String)
keyValue = do
    key <- many1 letter
    spaces
    char '='
    spaces
    value <- many1 (noneOf "n")
    return (key, value)

-- A parser for a configuration block
configBlock :: Parser [(String, String)]
configBlock = do
    string "server" <?> "a 'server' block"
    spaces
    char '{'
    spaces
    pairs <- keyValue `sepBy` spaces
    spaces
    char '}'
    return pairs

main :: IO ()
main = do
    let input = "server { host = "localhost" port = 8080 }"
    case parse configBlock "" input of
        Left err -> putStrLn $ "Parse error: " ++ show err
        Right config -> print config

Output:

[("host", ""localhost""), ("port", "8080")]
FeatureBenefit
Labels (<?>)Makes error messages descriptive.
try CombinatorHandles overlapping parsers without consuming input prematurely.
Combinators and ModularityEnables robust error handling even in complex parsing scenarios.

7. Parse, Don't Validate

One of the most powerful principles in Haskell is Parse, Don’t Validate, a concept popularized by Alexis King in their blog post of the same name. This principle emphasizes transforming unstructured input data (like user input, JSON, or text files) into structured and valid types as early as possible. By doing so, you eliminate invalid states and ensure that your program works with only well-formed data.


The Core Idea

What’s the Problem with Validation?

In many programming paradigms, data validation happens at runtime, typically in functions or methods that check if input is valid before processing it. This leaves room for errors:

2. What Does Parsing Solve?

Instead of validating data everywhere, parsing transforms raw input into valid types upfront. By using Haskell’s type system, you can encode constraints at the type level, guaranteeing that only valid data is ever represented by a given type.

For example, rather than validating a string as an email address in multiple places, you can parse it once into an Email type. If the parsing succeeds, you’re guaranteed the value is valid.


Smart Constructors: A Tool for Parsing

The smart constructor pattern is a common way to implement this strategy. A smart constructor is a function that ensures only valid values can be constructed for a type.

Example: Validating an Email Address

module Email (Email, mkEmail) where

import Text.Parsec
import Text.Parsec.String (Parser)

-- Define the Email type (constructor is hidden from the outside)
newtype Email = Email String deriving (Show, Eq)

-- Smart constructor for Email
mkEmail :: String -> Either String Email
mkEmail input =
    case parse emailParser "" input of
        Left _ -> Left "Invalid email address"
        Right email -> Right (Email email)

-- Parsec parser for email validation
emailParser :: Parser String
emailParser = do
    local <- many1 (letter <|> digit <|> oneOf "!#$%&'*+/=?^_`{|}~-")
    char '@'
    domain <- many1 (letter <|> digit <|> char '.')
    return (local ++ "@" ++ domain)

Key Points:


Benefits of Parsing Over Validation

  1. Invalid States Are Impossible:
    • Once data is parsed into a valid type, you don’t need to check its validity elsewhere in your code.
  2. Centralized Logic:
    • Parsing logic is centralized in the smart constructor, making it easier to maintain and extend.
  3. Type-Safe Guarantees:
    • Functions working with the Email type are guaranteed to operate on valid email addresses, thanks to the type system.
  4. Improved Readability:
    • Your application logic becomes cleaner and more focused, as validation checks are no longer scattered throughout the code.

Suggested Readings

  1. To dive deeper into this concept, read Alexis King's blog post, "Parse, Don’t Validate". This blog post explores the philosophy behind this approach, its advantages, and practical examples. It’s an essential read for any Haskell developer looking to write safer and more reliable code.

  2. There is a new field of study called LangSec (short for Language-Theoretic Security) which discusses the best practices for writing code that is secure from the perspective of language theory. It is based on type theory and highlights many of the antipatterns currently used in the industry and how they create unsafe code. Familiarizing yourself with this topic is critical for your success in becoming a lead developer. Read The Seven Turrets of Babel: A Taxonomy of LangSec Errors and How to Expunge Them and then familiarize yourself with their other publications on their website.

  3. If you didn't read this above, familiarize yourself with the concept of tainted paths with this analysis of shotgun parsing in Android applications done by the LangSec team.

Do not overlook the importance of these articles. Too much of the industry is guilty of not following these rules and they end up paying for it later by having too many edge cases, leading to runtime failures and tons of money wasted on writing complex and comprehensive unit testing suites.

Understanding and practicing these techniques are the keys to success in writing software that doesn't fail at runtime, ultimately making you lead developer material.

By adopting the Parse, Don’t Validate principle, you leverage Haskell’s type system to eliminate invalid states at compile time, reducing runtime errors and making your codebase more robust.

8. Recap & Exercises

Recap

In this chapter, we explored the power of Parsec for string parsing and how it stands out compared to traditional regular expressions. Here’s a summary of the key points covered:

Introduction to Parsing with Parsec

Advantages of Parsec Over Regex

Parsec Syntax Cheat Sheet

Building Complex Parsers

Advanced Error Handling

Parse, Don’t Validate

Exercises

Try implementing the Parse, Don’t Validate principle for other real-world types:

Exercise 1

Create a NonEmptyString type that ensures a string is never empty.

mkNonEmptyString :: String -> Either String NonEmptyString
mkNonEmptyString str =
  -- Your implementation here

Exercise 2

Design a PositiveInt type that guarantees only positive integers.

mkPositiveInt :: String -> Either String PositiveInt
mkPositiveInt str =
  -- Your implementation here

Exercise 3

Implement a PhoneNumber type for validating phone numbers.

mkPhoneNumber :: String -> Either String PhoneNumber
mkPhoneNumber str =
  -- Your implementation here

Previous Chapter | Next Chapter