03 Jun 2007

Adventures in parsing, part 3

I got a great many comments, at least by my standards, on my earlier two posts on parsing in Haskell. Especially on the latest one. Conal posted a comment on the first pointing me towards liftM and its siblings, without telling me that it would only be the first step towards "applicative style". So, here I go again…

First off, importing Control.Applicative. Apparently <|> is defined in both Applicative and in Parsec. I do use <|> from Parsec so preventing importing it from Applicative seemed like a good idea:

import Control.Applicative hiding ( (<|>) )

Second, Cale pointed out that I need to make an instance for Control.Applicative.Applicative for GenParser. He was nice enough to point out how to do that, leaving syntax the only thing I had to struggle with:

instance Applicative (GenParser c st) where
    pure = return
    (<*>) = ap

I decided to take baby-steps and I started with parseAddress. Here's what it used to look like:

parseAddress = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in do
        start <- liftM hexStr2Int $ thenChar '-' $ many1 hexDigit
        end <- liftM hexStr2Int $ many1 hexDigit
        return $ Address start end

On Twan's suggestion I rewrote it using where rather than let ... in and since this was my first function I decided to go via the ap function (at the same time I broke out hexStr2Int since it's used in so many places):

parseAddress = do
    start <- return hexStr2Int `ap` (thenChar '-' $ many1 hexDigit)
    end <- return hexStr2Int `ap` (many1 hexDigit)
    return $ Address start end

Then on to applying some functions from Applicative:

parseAddress = Address start end
    where
        start = hexStr2Int <$> (thenChar '-' $ many1 hexDigit)
        end = hexStr2Int <$> (many1 hexDigit)

By now the use of thenChar looks a little silly so I changed that part into many1 hexDigit <* char '-' instead. Finally I removed the where part altogether and use <*> to string it all together:

parseAddress = Address <$>
    (hexStr2Int <$> many1 hexDigit <* char '-') <*>
    (hexStr2Int <$> (many1 hexDigit))

From here on I skipped the intermediate steps and went straight for the last form. Here's what I ended up with:

parsePerms = Perms <$>
    ( (== 'r') <$> anyChar) <*>
    ( (== 'w') <$> anyChar) <*>
    ( (== 'x') <$> anyChar) <*>
    (cA <$> anyChar)

    where
        cA a = case a of
            'p' -> Private
            's' -> Shared

parseDevice = Device <$>
    (hexStr2Int <$> many1 hexDigit <* char ':') <*>
    (hexStr2Int <$> (many1 hexDigit))

parseRegion = MemRegion <$>
    (parseAddress <* char ' ') <*>
    (parsePerms <* char ' ') <*>
    (hexStr2Int <$> (many1 hexDigit <* char ' ')) <*>
    (parseDevice <* char ' ') <*>
    (Prelude.read <$> (many1 digit <* char ' ')) <*>
    (parsePath <|> string "")

    where
        parsePath = (many1 $ char ' ') *> (many1 anyChar)

I have to say I'm fairly pleased with this version of the parser. It reads about as easy as the first version and there's none of the "reversing" that thenChar introduced.

Comment by Conal Elliott:

A thing of beauty! I'm glad you stuck with it, Magnus.

Some much smaller points:

The pattern (== c) <$> anyChar (nicely written, btw) arises three times, so it might merit a name.
Similarly for hexStr2Int <$> many1 hexDigit, especially when you rewrite f <$> (a <* b) to (f <$> a) <* b.
The pattern (a <* char ' ') <*> b comes up a lot. How about naming it also, with a nice infix op, say a <#> b?
The cA definition could use pattern matching instead (e.g., cA 'p' = Private and cA 's' = Shared).
Some of your parens are unnecessary (3rd line of parseDevice and last of parseRegion), since application binds more tightly than infix ops.

Comment by Twan van Laarhoven:

First of all, note that you don't need parentheses around parseSomething <* char ' '.

You can also simplify things a bit more by combining hexStr2Int <$> many1 hexDigit into a function, then you could say:

parseHex = hexStr2Int <$> many1 hexDigit
parseAddress = Address <$> parseHex <* char '-' <*> parseHex
parseDevice  = Device <$> parseHex <</em> char ':' <*> parseHex

Also, in cA, should there be a case for character other than 'p' or 's'? Otherwise the program could fail with a pattern match error.

Response to Conal and Twan:

Conal and Twan, thanks for your suggestions. I'll put them into practice and post the "final" result as soon as I find some time.

Comment here.