Parsing is the process of analyzing a string or a sequence of characters to extract meaningful information or transform it into a structured format. While regular expressions (regex) are a common tool for parsing, they come with significant limitations, particularly in terms of readability, maintainability, and safety.
In this lesson, we’ll explore Parsec, a powerful library for parsing in Haskell. We’ll start by understanding why Parsec is superior to regular expressions and then move on to practical examples of building parsers.
Regular expressions are widely used for string parsing due to their compact syntax and ability to match patterns. However, they have several limitations:
Example:
import re
pattern = r"(d{4})-(d{2})-(d{2})"
match = re.match(pattern, "2024-12-29")
if match:
year, month, day = match.groups()
else:
print("Invalid date")
If the input string doesn’t match the pattern, this code will fail without meaningful feedback on why the match failed.
In contrast, Parsec provides compile-time guarantees by working with typed parsers, ensuring that the input conforms to the expected structure.
Example:
pattern = r"^([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,})$"
What does this pattern do? While it matches an email address, understanding or modifying it requires deep knowledge of regex syntax.
The regular expression above handles the simplest of email patterns, but as laid out in RFC5322, email patterns can be very complex. The following regular expression is an example of a regular expression used for email address format validation in RFC822:
(?:(?:rn)?[ t])*(?:(?:(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*))*|(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)*<(?:(?:rn)?[ t])*(?:@(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*))*(?:,@(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*))*)*:(?:(?:rn)?[ t])*)?(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*))*>(?:(?:rn)?[ t])*)|(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)*:(?:(?:rn)?[ t])*(?:(?:(?:[^()<>@,;:\".[]000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*))*|(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)*<(?:(?:rn)?[ t])*(?:@(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*))*(?:,@(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*))*)*:(?:(?:rn)?[ t])*)?(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*))*>(?:(?:rn)?[ t])*)(?:,s*(?:(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*))*|(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)*<(?:(?:rn)?[ t])*(?:@(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*))*(?:,@(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*))*)*:(?:(?:rn)?[ t])*)?(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|"(?:[^"r\]|\.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:\".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:\".[]]))|[([^[]r\]|\.)*](?:(?:rn)?[ t])*))*>(?:(?:rn)?[ t])*))*)?;s*)
You can see how this becomes wildly complicated and prone to all kinds of logic errors and typos. Not to mention that standards change every couple of years. While there may be popular libraries like email-validator to handle this (assuming the developer who wrote the library also didn't make any mistakes), imagine trying to parse complex data that doesn't have a common library. The issue persists.
Parsec, on the other hand, allows you to build parsers in a compositional and modular way, where each part of the parser has a clear, readable purpose.
Here is an example of the same parser but done with a Parser Combinator instead:
import Text.Parsec
import Text.Parsec.String (Parser)
import Control.Monad (void)
import Data.Char (isAlphaNum, isAscii)
-- Helper: Parse a specific character or fail
charP :: Char -> Parser Char
charP c = char c <?> [c]
-- Helper: Parse a range of allowed characters
satisfyP :: (Char -> Bool) -> Parser Char
satisfyP f = satisfy f <?> "allowed character"
-- Helper: Parse whitespace
whitespace :: Parser ()
whitespace = void $ many (oneOf " trn")
-- 1. Local Part Parsers (Before @)
localPart :: Parser String
localPart = try quotedString <|> unquotedLocalPart
quotedString :: Parser String
quotedString = do
charP '"' -- Opening quote
content <- many (escapedChar <|> satisfyP (/= '"'))
charP '"' -- Closing quote
return content
unquotedLocalPart :: Parser String
unquotedLocalPart = many1 (satisfyP isValidLocalChar)
isValidLocalChar :: Char -> Bool
isValidLocalChar c = isAlphaNum c || c `elem` "!#$%&'*+-/=?^_`{|}~."
escapedChar :: Parser Char
escapedChar = charP '\' >> anyChar
-- 2. Domain Parsers (After @)
domain :: Parser String
domain = try domainLiteral <|> domainName
domainLiteral :: Parser String
domainLiteral = do
charP '['
content <- many (escapedChar <|> satisfyP (/= ']'))
charP ']'
return content
domainName :: Parser String
domainName = do
first <- many1 (satisfyP isAlphaNum)
rest <- many (charP '.' >> many1 (satisfyP isAlphaNum))
return $ first ++ concatMap ('.' :) rest
-- 3. Full Email Parser
emailParser :: Parser (String, String)
emailParser = do
local <- localPart
charP '@'
domain <- domain
return (local, domain)
-- Example Usage
main :: IO ()
main = do
let testEmails =
[ "user@example.com"
, ""quoted@name"@example.com"
, "user@[192.168.1.1]"
, "complex.local-part+extra@sub.domain.com"
]
mapM_ (printResult . parse emailParser "") testEmails
printResult :: Either ParseError (String, String) -> IO ()
printResult (Left err) = putStrLn $ "Parse error: " ++ show err
printResult (Right email) = putStrLn $ "Parsed: " ++ show email
As you can see, it's much easier to read and to maintain. It also allows us to specify types for everything and parse into custom types, which we will explore in more detail later in this chapter.
Example:
pattern = r"(a+)+"
re.match(pattern, "aaaaaaaa")
# Depending on the engine, this can cause excessive backtracking.
Parsec parsers are deterministic by design. Each parser consumes the input step-by-step, ensuring there’s no ambiguity or unnecessary backtracking.
The runtime performance of a regular expression can be faster than a parser combinator in simple scenarios, but there are many scenarios where a regular expression cannot handle parsing in a performant manner, and often times it will time-out. Regular expressions are not good at handling nested data structures.
We want to parse and evaluate nested mathematical expressions such as:
"2 + (3 * 4)"
, "(1 + 2) * (3 - 4)"
, "10 / (5 + 5)"
"2 + (3 * 4"
, "2 + 3)"
Regex Implementation Regex struggles with parsing nested structures. While you can write a regex pattern to match numbers and simple operators, handling parentheses correctly is nearly impossible.
Example regex for basic math expressions:
import re
# Regex for simple math expressions
pattern = r"^d+(s*[+-*/]s*d+)*$"
# Match without considering nesting
def is_valid_math(input):
return bool(re.match(pattern, input))
print(is_valid_math("2 + (3 * 4)")) # Fails
print(is_valid_math("2 + 3 * 4")) # Passes
This regex:
"2 + (3 * 4)"
.
Parsec Implementation With Parsec, we can elegantly handle nested structures.
import Text.Parsec import Text.Parsec.String (Parser)
-- Define a parser for mathematical expressions
data Expr
= Num Int
| Add Expr Expr
| Sub Expr Expr
| Mul Expr Expr
| Div Expr Expr
deriving Show
-- Parse an integer
parseNumber :: Parser Expr
parseNumber = do
num <- many1 digit
return $ Num (read num)
-- Parse parentheses
parseParens :: Parser Expr
parseParens = do
char '('
expr <- parseExpr char ')'
return expr
-- Parse operators and combine expressions
parseTerm :: Parser (Expr -> Expr -> Expr)
parseTerm = (char '+' >> return Add)
<|> (char '-' >> return Sub)
<|> (char '*' >> return Mul)
<|> (char '/' >> return Div)
-- Parse a full expression
parseExpr :: Parser Expr
parseExpr = do
left <- parseNumber <|> parseParens
rest <- many (do
op <- parseTerm
right <- parseNumber <|> parseParens
return (op right))
return $ foldl (acc f -> f acc) left rest
-- Test the parser
main :: IO ()
main = do
let inputs = ["2 + (3 * 4)", "(1 + 2) * (3 - 4)", "10 / (5 + 5)", "2 + (3 * 4"]
mapM_ (printResult . parse parseExpr "") inputs
printResult :: Either ParseError Expr -> IO ()
printResult (Left err) = putStrLn $ "Invalid: " ++ show err
printResult (Right expr) = putStrLn $ "Valid: " ++ show expr
Benchmark Comparison Parsing 1,000,000 mathematical expressions:
Implementation | Input | Runtime |
---|---|---|
Regex | "2 + (3 * 4)" | Timeout |
Parsec | "2 + (3 * 4)" | ~20ms |
Parsec operates with typed parsers that clearly define what they parse and what they return. For example:
Int
.
Date
type.
This type safety prevents common errors in parsing and makes your code more robust.
Example:
import Text.Parsec
import Text.Parsec.String (Parser)
integer :: Parser Int
integer = read <$> many1 digit
Here:
integer
parser guarantees that it will return an Int
.
Parsec parsers are modular and composable, meaning you can combine small parsers into larger ones to handle complex patterns.
Example:
import Text.Parsec
import Text.Parsec.String (Parser)
dateParser :: Parser (Int, Int, Int)
dateParser = do
year <- integer
_ <- char '-'
month <- integer
_ <- char '-'
day <- integer
return (year, month, day)
This parser for dates is:
integer
and char
parsers are reused to build the dateParser
.
Regular expressions provide minimal feedback when a match fails. Parsec parsers, on the other hand, produce detailed error messages, including:
Example:
import Text.Parsec
import Text.Parsec.String (Parser)
dateParser :: Parser (Int, Int, Int)
dateParser = do
year <- integer
_ <- char '-'
month <- integer
_ <- char '-'
day <- integer
return (year, month, day)
main :: IO ()
main = parseTest dateParser "2024-12"
Output:
parse error at (line 1, column 8): unexpected end of input expecting digit
This detailed feedback makes debugging easier and ensures your parsers are robust against unexpected inputs.
Parsec ensures that parsers are deterministic:
Example: Parsing a choice of patterns:
import Text.Parsec
import Text.Parsec.String (Parser)
parser :: Parser String
parser = try (string "hello") <|> string "help"
main :: IO ()
main = parseTest parser "help"
Here, the try
function ensures that Parsec can test the first parser ("hello"
) without consuming input prematurely. This makes parsing unambiguous and efficient.
Feature | Regex | Parsec |
---|---|---|
Type Safety | None (operates on raw strings) | Strong (works with typed parsers) |
Readability | Cryptic and hard to modify | Modular and composable |
Error Reporting | Minimal | Detailed and informative |
Determinism | Non-deterministic with backtracking | Deterministic and predictable |
This table combines syntax and types for commonly used functions in Parsec, along with descriptions and examples for quick reference.
Function | Type | Description | Example Usage |
---|---|---|---|
parse | Parser a -> SourceName -> String -> Either ParseError a | Runs a parser on input and returns the result or an error. | parse digit "" "123" |
try | Parser a -> Parser a | Allows backtracking; tries the parser without consuming input on failure. | try (string "hello") <|> string "help" |
choice | [Parser a] -> Parser a | Tries a list of parsers in sequence until one succeeds. | choice [string "a", string "b", string "c"] |
<|> | Parser a -> Parser a -> Parser a | Choice combinator: Tries the first parser, and if it fails, tries the second parser. | string "yes" <|> string "no" |
many | Parser a -> Parser [a] | Matches zero or more occurrences of a parser. | many digit |
many1 | Parser a -> Parser [a] | Matches one or more occurrences of a parser. | many1 letter |
optional | Parser a -> Parser () | Tries a parser and succeeds with no value if the parser fails. | optional (char '-') |
sepBy | Parser a -> Parser sep -> Parser [a] | Parses zero or more occurrences of a parser separated by another parser. | digit `sepBy` char ',' |
sepBy1 | Parser a -> Parser sep -> Parser [a] | Parses one or more occurrences of a parser separated by another parser. | digit `sepBy1` char ',' |
endBy | Parser a -> Parser sep -> Parser [a] | Parses zero or more occurrences of a parser, each followed by a separator. | digit `endBy` char ';' |
endBy1 | Parser a -> Parser sep -> Parser [a] | Parses one or more occurrences of a parser, each followed by a separator. | digit `endBy1` char ';' |
between | Parser open -> Parser close -> Parser a -> Parser a | Parses content between two other parsers. | between (char '(') (char ')') digit |
chainl1 | Parser a -> Parser (a -> a -> a) -> a | Parses left-associative chains (e.g., addition or subtraction). | digit `chainl1` (char '+' >> return (+)) |
chainr1 | Parser a -> Parser (a -> a -> a) -> a | Parses right-associative chains (e.g., exponentiation). | digit `chainr1` (char '^' >> return (^)) |
lookAhead | Parser a -> Parser a | Peeks at input without consuming it. | lookAhead (string "test") |
notFollowedBy | Parser a -> Parser () | Ensures a parser does not match at a given position. | notFollowedBy (char 'x') |
eof | Parser () | Ensures the end of input has been reached. | digit >> eof |
anyChar | Parser Char | Matches any single character. | anyChar |
noneOf | [Char] -> Parser Char | Matches any character not in the given list. | noneOf "aeiou" |
oneOf | [Char] -> Parser Char | Matches any character in a given list. | oneOf "aeiou" |
char | Char -> Parser Char | Matches a specific character. | char 'a' |
string | String -> Parser String | Matches a specific string. | string "hello" |
digit | Parser Char | Matches a single digit (0-9 ). | digit |
letter | Parser Char | Matches a single alphabetic character (a-z , A-Z ). | letter |
spaces | Parser () | Matches zero or more whitespace characters. | spaces |
<?> | Parser a -> String -> Parser a | Adds a label to a parser for better error messages. | digit <?> "a digit" |
This cheat sheet provides a comprehensive overview of the most useful Parsec functions, making it easier to reference and use them in your parsing projects.
More Parsec functions can be explored via the Hackage package explorer: https://hackage.haskell.org/package/parsec
As we explore Parsec further, let’s build parsers that showcase the composability and power of parser combinators for real-world scenarios. These examples will illustrate how Parsec can handle structured data, nested inputs, and error reporting.
Key-value pairs are a common structure in configuration files or query strings (e.g., key=value
).
import Text.Parsec
import Text.Parsec.String (Parser)
-- Parse a key-value pair
keyValuePair :: Parser (String, String)
keyValuePair = do
key <- many1 letter
char '='
value <- many1 (noneOf "n")
return (key, value)
-- Parse multiple key-value pairs separated by newlines
keyValuePairs :: Parser [(String, String)]
keyValuePairs = keyValuePair `sepBy` char 'n'
main :: IO ()
main = do
let input = "username=adminnpassword=1234ntimeout=30"
case parse keyValuePairs "" input of
Left err -> putStrLn $ "Parse error: " ++ show err
Right pairs -> print pairs
Explanation:
=
and then a value (anything except a newline).
keyValuePair
parsers using sepBy
, which ensures pairs are separated by a newline.
Output:
[("username", "admin"), ("password", "1234"), ("timeout", "30")]
Let’s parse a simplified JSON structure, focusing on key-value pairs with nested objects.
import Text.Parsec
import Text.Parsec.String (Parser)
-- A parser for keys
key :: Parser String
key = between (char '"') (char '"') (many1 letter)
-- A parser for string values
stringValue :: Parser String
stringValue = between (char '"') (char '"') (many (noneOf """))
-- A parser for key-value pairs
keyValue :: Parser (String, String)
keyValue = do
k <- key
char ':'
v <- stringValue
return (k, v)
-- A parser for objects (curly-brace enclosed key-value pairs)
jsonObject :: Parser [(String, String)]
jsonObject = between (char '{') (char '}') (keyValue `sepBy` char ',')
main :: IO ()
main = do
let input = "{"name":"John", "age":"30", "city":"New York"}"
case parse jsonObject "" input of
Left err -> putStrLn $ "Parse error: " ++ show err
Right obj -> print obj
Explanation:
between
to match keys enclosed in double quotes.
Output:
[("name", "John"), ("age", "30"), ("city", "New York")]
If the input is malformed (e.g., {"name":John}
), Parsec produces:
Parse error at (line 1, column 8): unexpected "J" expecting """
This error clearly indicates where the issue lies (column 8
) and what the parser expected ("""
).
.ini
files or .yaml
files.
One of the strengths of Parsec is its ability to provide detailed and meaningful error messages. This is essential for debugging complex parsers or giving users helpful feedback when input is invalid.
Parsec allows you to label parts of the parser using the <?>
operator. This makes error messages more descriptive when parsing fails.
import Text.Parsec
import Text.Parsec.String (Parser)
-- A parser for integers
integer :: Parser Int
integer = read <$> many1 digit <?> "an integer"
-- A parser for comma-separated integers
commaSeparatedIntegers :: Parser [Int]
commaSeparatedIntegers = integer `sepBy` (char ',' <?> "a comma")
main :: IO ()
main = do
let input = "123,456,abc,789"
case parse commaSeparatedIntegers "" input of
Left err -> putStrLn $ "Parse error: " ++ show err
Right nums -> print nums
Explanation:
<?>
operator adds a label to a parser. For example, integer <?> "an integer"
makes it clear what kind of input the parser expects.
Output for Invalid Input:
Parse error at (line 1, column 9): unexpected "a" expecting an integer
try
to Handle AmbiguitiesIn cases where multiple parsers overlap, you can use the try
combinator to backtrack and test another parser without consuming input prematurely.
import Text.Parsec
import Text.Parsec.String (Parser)
-- A parser for "hello" or "help"
greeting :: Parser String
greeting = try (string "hello") <|> string "help"
main :: IO ()
main = do
let input = "help"
case parse greeting "" input of
Left err -> putStrLn $ "Parse error: " ++ show err
Right result -> putStrLn $ "Parsed: " ++ result
Explanation:
try
, the string "hello"
parser would consume part of the input, causing the string "help"
parser to fail.
try
combinator ensures that input isn’t consumed if the parser fails, allowing subsequent parsers to run.
You can combine parsers and provide detailed error messages for complex input.
-- A parser for a custom command syntax
command :: Parser (String, String)
command = do
cmd <- (string "run" <|> string "stop") <?> "a command ('run' or 'stop')"
space
arg <- many1 letter <?> "an argument"
return (cmd, arg)
main :: IO ()
main = do
let input = "run123"
case parse command "" input of
Left err -> putStrLn $ "Parse error: " ++ show err
Right result -> print result
Output for Invalid Input:
Parse error at (line 1, column 4): unexpected "1" expecting a space
Let’s build a parser for a small domain-specific language (DSL), like a configuration file.
Input:
server {
host = "localhost"
port = 8080
}
Parser:
import Text.Parsec
import Text.Parsec.String (Parser)
-- A parser for key-value pairs
keyValue :: Parser (String, String)
keyValue = do
key <- many1 letter
spaces
char '='
spaces
value <- many1 (noneOf "n")
return (key, value)
-- A parser for a configuration block
configBlock :: Parser [(String, String)]
configBlock = do
string "server" <?> "a 'server' block"
spaces
char '{'
spaces
pairs <- keyValue `sepBy` spaces
spaces
char '}'
return pairs
main :: IO ()
main = do
let input = "server { host = "localhost" port = 8080 }"
case parse configBlock "" input of
Left err -> putStrLn $ "Parse error: " ++ show err
Right config -> print config
Output:
[("host", ""localhost""), ("port", "8080")]
Feature | Benefit |
---|---|
Labels (<?> ) | Makes error messages descriptive. |
try Combinator | Handles overlapping parsers without consuming input prematurely. |
Combinators and Modularity | Enables robust error handling even in complex parsing scenarios. |
One of the most powerful principles in Haskell is Parse, Don’t Validate, a concept popularized by Alexis King in their blog post of the same name. This principle emphasizes transforming unstructured input data (like user input, JSON, or text files) into structured and valid types as early as possible. By doing so, you eliminate invalid states and ensure that your program works with only well-formed data.
In many programming paradigms, data validation happens at runtime, typically in functions or methods that check if input is valid before processing it. This leaves room for errors:
Instead of validating data everywhere, parsing transforms raw input into valid types upfront. By using Haskell’s type system, you can encode constraints at the type level, guaranteeing that only valid data is ever represented by a given type.
For example, rather than validating a string as an email address in multiple places, you can parse it once into an Email
type. If the parsing succeeds, you’re guaranteed the value is valid.
The smart constructor pattern is a common way to implement this strategy. A smart constructor is a function that ensures only valid values can be constructed for a type.
module Email (Email, mkEmail) where
import Text.Parsec
import Text.Parsec.String (Parser)
-- Define the Email type (constructor is hidden from the outside)
newtype Email = Email String deriving (Show, Eq)
-- Smart constructor for Email
mkEmail :: String -> Either String Email
mkEmail input =
case parse emailParser "" input of
Left _ -> Left "Invalid email address"
Right email -> Right (Email email)
-- Parsec parser for email validation
emailParser :: Parser String
emailParser = do
local <- many1 (letter <|> digit <|> oneOf "!#$%&'*+/=?^_`{|}~-")
char '@'
domain <- many1 (letter <|> digit <|> char '.')
return (local ++ "@" ++ domain)
Key Points:
Email
type constructor is hidden, so it can only be created using the mkEmail
smart constructor.
mkEmail
function validates the input once, ensuring only valid email addresses are represented as Email
.
Email
type are guaranteed to operate on valid email addresses, thanks to the type system.
To dive deeper into this concept, read Alexis King's blog post, "Parse, Don’t Validate". This blog post explores the philosophy behind this approach, its advantages, and practical examples. It’s an essential read for any Haskell developer looking to write safer and more reliable code.
There is a new field of study called LangSec (short for Language-Theoretic Security) which discusses the best practices for writing code that is secure from the perspective of language theory. It is based on type theory and highlights many of the antipatterns currently used in the industry and how they create unsafe code. Familiarizing yourself with this topic is critical for your success in becoming a lead developer. Read The Seven Turrets of Babel: A Taxonomy of LangSec Errors and How to Expunge Them and then familiarize yourself with their other publications on their website.
If you didn't read this above, familiarize yourself with the concept of tainted paths with this analysis of shotgun parsing in Android applications done by the LangSec team.
Do not overlook the importance of these articles. Too much of the industry is guilty of not following these rules and they end up paying for it later by having too many edge cases, leading to runtime failures and tons of money wasted on writing complex and comprehensive unit testing suites.
Understanding and practicing these techniques are the keys to success in writing software that doesn't fail at runtime, ultimately making you lead developer material.
By adopting the Parse, Don’t Validate principle, you leverage Haskell’s type system to eliminate invalid states at compile time, reducing runtime errors and making your codebase more robust.
In this chapter, we explored the power of Parsec for string parsing and how it stands out compared to traditional regular expressions. Here’s a summary of the key points covered:
try
, choice
, many
, sepBy
, chainl1
, and more.
<?>
operator and the try
combinator.
Try implementing the Parse, Don’t Validate principle for other real-world types:
Create a NonEmptyString
type that ensures a string is never empty.
mkNonEmptyString :: String -> Either String NonEmptyString
mkNonEmptyString str =
-- Your implementation here
Design a PositiveInt
type that guarantees only positive integers.
mkPositiveInt :: String -> Either String PositiveInt
mkPositiveInt str =
-- Your implementation here
Implement a PhoneNumber
type for validating phone numbers.
mkPhoneNumber :: String -> Either String PhoneNumber
mkPhoneNumber str =
-- Your implementation here