CSE230 Fa16 - Monadic Parsing

> {-@ LIQUID "--no-termination" @-}
> {-@ LIQUID "--short-names"    @-}
> 
> {-# LANGUAGE LambdaCase #-}
> import Data.Char
> import Data.Functor
> import Control.Monad

exp ::= num | exp op exp { $2 $1 $3 } | ‘(’ exp ‘)’ { $2 } | “avg” ‘[’ explist ’]’ { Avg $3 } | ‘[’ intlist ’]’ { Scmp $2 }

exprP :: Parser Expr P (String -> [(Expr, String)]) – exprP (P f) str = [(e1, s1), (e2, s2),…]

What is a Parser?

A parser is a piece of software that takes a raw String (or sequence of bytes) and returns some structured object, for example, a list of options, an XML tree or JSON object, a program’s Abstract Syntax Tree and so on. Parsing is one of the most basic computational tasks. Every serious software system has a parser tucked away somewhere inside, for example

(Indeed I defy you to find any serious system that does not do some parsing somewhere!)

type Parser = String -> StructuredObject

Composing Parsers

The usual way to build a parser is by specifying a grammar and using a parser generator (eg yacc, bison, antlr) to create the actual parsing function. While elegant, one major limitation of the grammar based approach is its lack of modularity. For example, suppose I have two kinds of primitive values Thingy and Whatsit.

Thingy : rule 	{ action }
;

Whatsit : rule  { action }
;

If you want a parser for sequences of Thingy and Whatsit we have to painstakingly duplicate the rules as

Thingies : Thingy Thingies  { ... }
           EmptyThingy      { ... }
;

Whatsits : Whatsit Whatsits { ... }
           EmptyWhatsit     { ... }
;

This makes sub-parsers hard to reuse. Next, we will see how to compose mini-parsers for sub-values to get bigger parsers for complex values.

To do so, we will generalize the above parser type a little bit, by noting that a (sub-)parser need not (indeed, will not) consume consume all of its input, and so we can simply have the parser return the unconsumed input

type Parser = String -> (StructuredObject, String)

Of course, it would be silly to have different types for parsers for different kinds of objects, and so we can make it a parameterized type

type Parser a = String -> (a, String)

One last generalization: the parser could return multiple results, for example, we may want to parse the string

"2 - 3 - 4"

Minus (Minus 2 3) 4

Minus 2 (Minus 3 4)

So, we can have our parsers return a list of possible results (where the empty list corresponds to a failure to parse.)

> newtype Parser a = P (String -> [(a, String)])

> doParse (P p) s = p s

Parse A Single character

QUIZ

newtype Parser a = P (String -> [(a, String)])

Which of the following is a valid single-character-parser that returns the first Char from a string (if one exists.)

– b oneChar = P $ -> case cs of [] -> [(’’, [])] c:cs -> (c, cs) – c oneChar = P $ -> (head cs, tail cs)

> oneChar :: Parser Char
> oneChar = P (\cs -> case cs of
>                c:cs' -> [(c, cs')]
>                _     -> [])

ghci> doParse oneChar "hey!"
[('h',"ey!")]

ghci> doParse oneChar ""
[]

twoChar :: Parser (Char, Char)
twoChar  = P (\cs -> case cs of
             c1:c2:cs' -> [((c1, c2), cs')]
             _         -> [])

ghci> doParse twoChar "hey!"
[(('h', 'e'), "y!")]

ghci> doParse twoChar "h"
[]

Parser Composition

QUIZ

twoChar :: Parser (Char, Char)
twoChar  = P (\cs -> case cs of
             c1:c2:cs' -> [((c1, c2), cs')]
             _         -> [])

Suppose we had some foo such that foo' behaved identically to twoChar.

twoChar' :: Parser (Char, Char)
twoChar' = foo oneChar oneChar

Indeed, foo is a parser combinator that takes two parsers and returns a new parser that returns a pair of values:

pairP ::  Parser a -> Parser b -> Parser (a, b)
pairP p1 p2 = P (\cs ->
  [((x,y), cs'') | (x, cs' ) <- doParse p1 cs,
                   (y, cs'') <- doParse p2 cs']
  )

> twoChar = pairP oneChar oneChar

ghci> doParse twoChar "hey!"
[(('h','e'), "y!")]

ghci> doParse twoChar "h"
[]

Now we could keep doing this, but often to go forward, it is helpful to step back and take a look at the bigger picture.

newtype Parser a = P (String -> [(a, String)])

type ST a = S (State -> (a, State))

Parser is A Monad

Indeed, a parser, like a state transformer, is a monad! if you squint just the right way.

bindP :: Parser a -> (a -> Parser b) -> Parser b

so, we need to suck the a values out of the first parser and invoke the second parser with them on the remaining part of the string.

QUIZ

doParse           :: Parser a -> String -> [(a, String)]
doParse (P p) str = p str

bindP :: Parser a -> (a -> Parser b) -> Parser b
bindP p1 fp2 = P $ \cs ->
  [(y, cs'') | (x, cs')  <- undefined -- 1
             , (y, cs'') <- undefined -- 2
  ]

> bindP p1 fp2 = P $ \cs -> [(y, cs'') | (x, cs')  <- doParse p1 cs
>                                      , (y, cs'') <- doParse (fp2 x) cs']

See how we suck the a values out of the first parser (by running doParse) and invoke the second parser on each possible a (and the remaining string) to obtain the final b and remainder string tuples.

:type returnP
returnP :: a -> Parser a

> returnP x = P (\cs -> [(x, cs)])

> -- newtype Parser a = P (String -> [(a, String)])
> 
> instance Applicative Parser
> 
> instance Monad Parser where
>   (>>=)  = bindP
>   return = returnP
> 
> sequen :: (Monad m) => [m a] -> m [a]
> sequen []     = return []
> sequen (a:as) = do {x <- a; xs <- sequen as; return (x:xs) }
> 
> strP' :: String -> Parser String
> strP' cs = sequen (map charP cs)
> 
> charP :: Char -> Parser Char
> charP c = satP (c ==)
> 
> -- chooseP :: Parser a -> Parser a -> Parser a
> -- chooseP p1 p2 = P $ \s -> doParse p1 s ++ doParse p2 s

Parser Combinators

Since parsers are monads, we can write a bunch of high-level combinators for composing smaller parsers into bigger ones.

> pairP       :: Parser a -> Parser b -> Parser (a, b)
> pairP px py = do x <- px
>                  y <- py
>                  return (x, y)

Next, lets flex our monadic parsing muscles and write some new parsers. It will be helpful to have a a failure parser that always goes down in flames, that is, returns [] – no successful parses.

> failP = P (\_ -> [])

Seems a little silly to write the above, but its helpful to build up richer parsers like the following which parses a Char if it satisfies a predicate p

> satP ::  (Char -> Bool) -> Parser Char
> satP p = do c <- oneChar
>             if p c then return c else failP

> lowercaseP = satP isAsciiLower

ghci> doParse (satP ('h' ==)) "mugatu"
[]

ghci> doParse (satP ('h' ==)) "hello"
[('h',"ello")]

> alphaChar = satP isAlpha
> digitChar = satP isDigit

> digitInt  = do c <- digitChar
>                return ((read [c]) :: Int)

ghci> doParse digitInt "92"
[(9,"2")]

ghci> doParse digitInt "cat"
[]

> char c = satP (== c)

EXERCISE: Write a function strP :: String -> Parser String such that strP s parses exactly the string s and nothing else, that is,

ghci> dogeP = strP "doge"

ghci> doParse dogeP "dogerel"
[("doge", "rel")]

ghci> doParse dogeP "doggoneit"
[]

A Nondeterministic Choice Combinator

Next, lets write a combinator that takes two sub-parsers and non-deterministically chooses between them.

chooseP :: Parser a -> Parser a -> Parser a

That is, we want chooseP p1 p2 to return a succesful parse if either p1 or p2 succeeds.

We can use chooseP to build a parser that returns either an alphabet or a numeric character

> alphaNumChar = alphaChar `chooseP` digitChar

ghci> doParse alphaNumChar "cat"
[('c', "at")]
ghci> doParse alphaNumChar "2cat"
[('2', "cat")]
ghci> doParse alphaNumChar "230"
[('2', "30")]

-- a
p1 `chooseP` p2 = do xs <- p1
                     ys <- p2
                     return (x1 ++ x2)
-- b
p1 `chooseP` p2 = do xs <- p1
                     case xs of
                       [] -> p2
                       _  -> return xs
-- c
p1 `chooseP` p2 = P $ \cs -> doParse p1 cs ++ doParse p2 cs

-- d
p1 `chooseP` p2 = P $ \cs -> case doParse p1 cs of
                               [] -> doParse p2 cs
                               rs -> rs

> chooseP :: Parser a -> Parser a -> Parser a
> p1 `chooseP` p2 = P $ \cs -> case doParse p1 cs of
>                               [] -> doParse p2 cs
>                               r  -> r

Thus, what is even nicer is that if both parsers succeed, you end up with all the results.

> grabn :: Int -> Parser String
> grabn n
>   | n <= 0    = return ""
>   | otherwise = do c  <- oneChar
>                    cs <- grabn (n-1)
>                    return (c:cs)

> foo = grabn 2 `chooseP` grabn 4

ghci> doParse foo "mickeymouse"

ghci> doParse grab2or4 "mic"
[("mi","c")]

ghci> doParse grab2or4 "m"
[]

Even with the rudimentary parsers we have at our disposal, we can start doing some rather interesting things. For example, here is a little calculator. First, we parse the operation

> intOp      :: Parser (Int -> Int -> Int)
> intOp      = plus `chooseP` minus `chooseP` times `chooseP` divide
>   where
>     plus   = char '+' >> return (+)
>     minus  = char '-' >> return (-)
>     times  = char '*' >> return (*)
>     divide = char '/' >> return div

> calc = do x <- digitInt
>           o <- intOp
>           y <- digitInt
>           return (x `o` y)

ghci> doParse calc "8/2"
[(4,"")]

ghci> doParse calc "8+2cat"
[(10,"cat")]

ghci> doParse calc "8/2cat"
[(4,"cat")]

ghci> doParse calc "8-2cat"
[(6,"cat")]

ghci> doParse calc "8*2cat"
[(16,"cat")]

ghci> doParse calc "99bottles"

Recursive Parsing

To start parsing interesting things, we need to add recursion to our combinators. For example, its all very well to parse individual characters (as in char above) but it would a lot more swell if we could grab particular String tokens.

string :: String -> Parser String
string ""     = return ""
string (c:cs) = do char c
                   string cs
                   return (c:cs)

DO IN CLASS Ewww! Is that explicit recursion ?! Lets try again (can you spot the pattern)

> string :: String -> Parser String
> string = undefined -- fill this in

ghci> doParse (string "mic") "mickeyMouse"
[("mic","keyMouse")]

ghci> doParse (string "mic") "donald duck"
[]

Lets write a combinator that takes a parser p that returns an a and returns a parser that returns many a values. That is, it keeps grabbing as many a values as it can and returns them as a [a].

> manyP     :: Parser a -> Parser [a]
> manyP p   = many1 `chooseP` many0
>   where
>     many0 = return []
>     many1 = do x  <- p
>                xs <- manyP p
>                return (x:xs)

ghci> doParse (manyP digitInt) "123a"
[([], "123a"), ([1], "23a"),([1, 2], "3a"),([1, 2, 3], "a")]

which is simply all the possible ways to extract sequences of integers from the input string.

Deterministic Maximal Parsing

Often we want a single result, not a set of results. For example, the more intuitive behavior of many would be to return the maximal sequence of elements and not all the prefixes.

> (<|>) :: Parser a -> Parser a -> Parser a
> p1 <|> p2 = P $ \cs -> case doParse (p1 `chooseP` p2) cs of
>                          []  -> []
>                          x:_ -> [x]

The above runs choice parser but returns only the first result. Now, we can revisit the manyP combinator and ensure that it returns a single, maximal sequence

> mmanyP     :: Parser a -> Parser [a]
> mmanyP p   = mmany1 <|> mmany0
>   where
>     mmany0 = return []
>     mmany1 = do x  <- p
>                 xs <- mmanyP p
>                 return (x:xs)

DO IN CLASS Wait a minute! What exactly is the difference between the above and the original manyP? How do you explain this:

ghci> doParse (manyP digitInt) "123a"
[([1,2,3],"a"),([1,2],"3a"),([1],"23a"),([],"123a")]

ghci> doParse (mmanyP digitInt) "123a"
[([1,2,3],"a")]

Lets use the above to write a parser that will return an entire integer (not just a single digit.)

oneInt :: Parser Integer
oneInt = do xs <- mmanyP digitChar
            return $ ((read xs) :: Integer)

Aside, can you spot the pattern above? We took the parser mmanyP digitChar and simply converted its output using the read function. This is a recurring theme, and the type of what we did gives us a clue

(a -> b) -> Parser a -> Parser b

Aha! a lot like map. Indeed, there is a generalized version of map that we have seen before (lift1) and we bottle up the pattern by declaring Parser to be an instance of the Functor typeclass

> instance Functor Parser where
>   fmap f p = do x <- p
>                 return (f x)

> oneInt ::  Parser Int
> oneInt = read `fmap` mmanyP digitChar

ghci> doParse oneInt "123a"
[(123, "a")]

Parsing Arithmetic Expressions

Lets use the above to build a small calculator, that parses and evaluates arithmetic expressions. In essence, an expression is either binary operand applied to two sub-expressions or an integer. We can state this as

> calc0      ::  Parser Int
> calc0      = binExp <|> oneInt
>   where
>     binExp = do x <- oneInt
>                 o <- intOp
>                 y <- calc0
>                 return $ x `o` y

ghci> doParse calc0 "1+2+33"
[(36,"")]

ghci> doParse calc0 "11+22-33"
[(0,"")]

ghci> doParse calc0 "11+22-33+45"
[(-45,"")]

Huh? Well, if you look back at the code, you’ll realize the above was parsed as

11 + ( 22 - (33 + 45))

because in each binExp we require the left operand to be an integer. In other words, we are assuming that each operator is right associative hence the above result.

ghci> doParse calc0 "10*2+100"
[(1020,"")]

10 * (2 + 100)

> calc1      ::  Parser Int
> calc1      = binExp <|> oneInt
>   where
>     binExp = do x <- calc1
>                 o <- intOp
>                 y <- oneInt
>                 return $ x `o` y

ghci> doParse calc1 "11+22-33+45"

ghci> doParse calc1 "2+2"

Precedence

We can add both associativity and precedence, by stratifying the parser into different levels. Here, lets split our operations into addition-

> addOp       = plus `chooseP` minus
>   where
>     plus    = char '+' >> return (+)
>     minus   = char '-' >> return (-)

> mulOp       = times `chooseP` divide
>   where
>     times   = char '*' >> return (*)
>     divide  = char '/' >> return div

Now, we can stratify our language into (mutually recursive) sub-languages, where each top-level expression is parsed as a sum-of-products

> sumE     = addE <|> prodE
>   where
>     addE = do x <- prodE
>               o <- addOp
>               y <- sumE
>               return $ x `o` y
> 
> prodE    = mulE <|> factorE
>   where
>     mulE = do x <- factorE
>               o <- mulOp
>               y <- prodE
>               return $ x `o` y
> 
> factorE = parenP sumE <|> oneInt

ghci> doParse sumE "10*2+100"
[(120,"")]

ghci> doParse sumE "10*(2+100)"
[(1020,"")]

Do you understand why the first parse returned 120 ? What would happen if we swapped the order of prodE and sumE in the body of addE (or factorE and prodE in the body of prodE) ? Why?

factorE :: Parser Int
factorE = parenP sumE <|> oneInt

> parenP p = do char '('
>               x <- p
>               char ')'
>               return x

Parsing Pattern: Chaining

There is not much point gloating about combinators if we are going to write code like the above: the bodies of sumE and prodE are almost identical!

prodE + < prodE + < prodE + ... < prodE >>>

that is, we keep chaining together prodE values and adding them for as long as we can. Similarly a prodE is of the form

factorE * < factorE * < factorE * ... < factorE >>>

where we keep chaining factorE values and multiplying them for as long as we can. There is something unpleasant about the above: the addition operators are right-associative

ghci> doParse sumE "10-1-1"
[(10,"")]

Ugh! I hope you understand why: its because the above was parsed as 10 - (1 - 1) (right associative) and not (10 - 1) - 1 (left associative). You might be tempted to fix that simply by flipping the order of prodE and sumE

sumE     = addE <|> prodE
  where
    addE = do x <- sumE
              o <- addOp
              y <- prodE
              return $ x `o` y

The parser for sumE directly (recursively) calls itself without consuming any input! Thus, it goes off the deep end and never comes back. Instead, we want to make sure we keep consuming prodE values and adding them up (rather like fold) and so we could do

> sumE1       = prodE1 >>= addE1
>   where
>     addE1 x = grab x <|> return x
>     grab  x = do o <- addOp
>                  y <- prodE1
>                  addE1 $ x `o` y
> 
> prodE1      = factorE1 >>= mulE1
>   where
>     mulE1 x = grab x <|> return x
>     grab  x = do o <- mulOp
>                  y <- factorE1
>                  mulE1 $ x `o` y
> 
> factorE1 = parenP sumE1 <|> oneInt

ghci> doParse sumE1 "10-1-1"
[(8,"")]

and it is also very easy to spot and bottle the chaining computation pattern: the only differences are the base parser (prodE1 vs factorE1) and the binary operation (addOp vs mulOp). We simply make those parameters to our chain-left combinator

> p `chainl` op = p >>= rest
>    where
>      rest x   = grab x <|> return x
>      grab x   = do o <- op
>                    y <- p
>                    rest $ x `o` y

> sumE2    = prodE2   `chainl` addOp
> prodE2   = factorE2 `chainl` mulOp
> factorE2 = parenP sumE2 <|> oneInt

ghci> doParse sumE2 "10-1-1"
[(8,"")]

ghci> doParse sumE2 "10*2+1"
[(21,"")]

ghci> doParse sumE2 "10+2*1"
[(12,"")]

That concludes our in-class exploration of monadic parsing. This is merely the tip of the iceberg. Though parsing is a very old problem, and has been studied since the dawn of computing, we saw how monads bring a fresh perspective which have recently been transferred from Haskell to many other languages. There have been several exciting recent papers on the subject, that you can explore on your own. Finally, Haskell comes with several parser combinator libraries including Parsec which you will play around with in HW2.