29 May 2007

More adventures in parsing

I received an interesting comment from Conal Elliott on my previous post on parsing I have to admit I wasn't sure I understood him at first, I'm still not sure I do, but I think I have an idea of what he means :-)

Basically my code is very sequential in that I use the do construct everywhere in the parsing code. Personally I thought that makes the parser very easy to read since the code very much mimics the structure of the maps file. I do realise the code isn't very "functional" though so I thought I'd take Conal's comments to heart and see what the result would be.

Let's start with observation that every entity in a line is separated by a space. However some things are separated by other characters. So the first thing I did was write a higher-order function that first reads something, then reads a character and returns the first thing that was read:

thenChar c f = f >>= (\ r -> char c >> return r)

Since space is used as a separator so often I added a short-cut for that:

thenSpace  = thenChar ' '

Then I put that to use on parseAddress:

parseAddress = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in do
        start <- thenChar '-' $ many1 hexDigit
        end <- many1 hexDigit
        return $ Address (hexStr2Int start) (hexStr2Int end)

Modifying the other parsing functions using =thenChar~ and thenSpace is straight forward.

I'm not entirely sure I understand what Conal meant with the part about liftM in his comment. I suspect his referring to the fact that I first read characters and then convert them in the "constructors". By using liftM I can move the conversion "up in the code". Here's parseAddress after I've moved the calls to hexStr2Int:

parseAddress = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in do
        start <- liftM hexStr2Int $ thenChar '-' $ many1 hexDigit
        end <- liftM hexStr2Int $ many1 hexDigit
        return $ Address start end

After modifying the other parsing functions in a similar way I ended up with this:

parsePerms = let
        cA a = case a of
            'p' -> Private
            's' -> Shared
    in do
        r <- liftM (== 'r') anyChar
        w <- liftM (== 'w') anyChar
        x <- liftM (== 'x') anyChar
        a <- liftM cA anyChar
        return $ Perms r w x a

parseDevice = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in do
        maj <- liftM hexStr2Int $ thenChar ':' $ many1 hexDigit
        min <- liftM hexStr2Int $ many1 hexDigit
        return $ Device maj min

parseRegion = let
        hexStr2Int = Prelude.read . ("0x" ++)
        parsePath = (many1 $ char ' ') >> (many1 $ anyChar)
    in do
        addr <- thenSpace parseAddress
        perm <- thenSpace parsePerms
        offset <- liftM hexStr2Int $ thenSpace $ many1 hexDigit
        dev <- thenSpace parseDevice
        inode <- liftM Prelude.read $ thenSpace $ many1 digit
        path <- parsePath <|> string ""
        return $ MemRegion addr perm offset dev inode path

Is this code more "functional"? Is it easier to read? You'll have to be the judge of that…

Conal, if I got the intention of your comment completely wrong then feel free to tell me I'm an idiot ;-)

Comment by Holger:

Your story and Conal's comment inspired me to play around with liftM and I came up with this version of parseAddress:

parseAddress = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in
        liftM3 (\ x _ y -> Address (hexStr2Int x) (hexStr2Int y)) (many1 hexDigit) (char '-') (many1 hexDigit)

You could probably rewrite all the functions in a similar way, but honestly I find your original code in do-notation much easier to read.

Response to Holger:

Yeah, the same line of thought made it a little difficult to fall asleep yesterday (yes, I know, nerdiness taken to new levels). My thoughts was something like this:

parseAddress = let
        hexStr2Int = ...
    in
        liftM2 Address
            (liftM hexStr2Int $ thenChar '-' $ many1 hexDigit)
            (liftM hexStr2Int $ many1 hexDigit)

I agree with you on readability. I also wonder if laziness could bite back or if `liftM2` guarantees an order of evaluation.

Comment by Holger:

I just looked at the source of liftM2/3 and it seems that it basically just resolves to an expression in do-notation. So it's just a shortcut and therefore should yield the same program.

Comment by Cale Gibbard:

liftM2 guarantees an ordering on the monadic computation, because it's defined like:

liftM2 f x y = do { u <- x; v <- y; return (f u v) }

Though, that's a little different from guaranteeing an order of evaluation – depending on the monad, the order of evaluation will vary. In any case, it shouldn't be much different from what you originally had.

Another thing you might like to play around with, at least in your head, is the fact that:

liftM2 f x y = return f `ap` x `ap` y

liftM3 f x y z = return f `ap` x `ap` y `ap` z

and so on, which leads up to the style embodied by the Control.Applicative library, though if you really want to play around with that, you'll need to write a quick instance of Applicative for GenParser, which should just consist of making pure = return and (<*>) = ap.

Comment by Twan van Laarhoven:

The best way to make parsing code readable is to use Data.Applicative. Also, most people prefer where to let..in if possible. This gives something like:

parseHexStr = Prelude.read . ("0x" ++)  many1 hexDigit
parsePath   = many1 (char ' ') *> many1 anyChar

parseAddress = Address  hexStr *> char '-'  hexStr

parseRegion = MemRegion
            parseAddress *> char ' '
            parsePerms   *> char ' '
            parseHexStr  *> char ' '
            parseDevice  *> char ' '
            (Prelude.read  many1 digit) *> char ' '
            (parsePath  return "")

Basicly (liftM# f x y z) can be written as f x <*> y <*> z.

Comment by Nick:

I find the thenSpace a bit difficult to read.I think something like this is more natural, as it maintains the left-to-right relationship of the parsed data and the following space:

aChar c r = char c >> return r
aSpace = aChar ' '

...
    do
        start <-  many1 hexDigit >> aChar
....

note that I haven't tested this and my haskell-fu is not very strong, so I could be way off here.

Comment by Conal Elliot:

Yes, that's exactly the direction i had in mind. once you switch from "do" style to "liftM#" style, it's a simple step to replace the monad operators to applicative functor operators.

Response to Nick:

Nick, yes your way of writing it is easier to read. You'll need to change to using >>= though:

do
    start <- many1 hexDigit >>= aChar '-'

Then you can mix in `liftM` as well:

do
    start <- liftM hexStr2Int $ many1 hexDigit >>= aChar '-'

to do the conversion. However, I think my initial version is even clearer:

do
    start <- liftM hexStr2Int $ many1 hexDigitc
    char '-'

Response to Conal Elliot:

Conal, ok, you're one sneaky little b… ;-) I'll have to look at the applicative operators to see what I think of them.

Comment here.