05 Jun 2007

Adventures in parsing, part 4

I received a few comments on part 3 of this little mini-series and I just wanted to address them. While doing this I still want the main functions of the parser parseXxx to read like the maps file itself. That means I want to avoid "reversing order" like thenChar and thenSpace did in part 2. I also don't want to hide things, e.g. I don't want to introduce a function that turns (a <* char ' ') <*> b into a <#> b.

So, first up is to do something about hexStr2Int <$> many1 hexDigit which appears all over the place. I made it appear in even more places by moving around a few parentheses; the following two functions are the same:

foo = a <$> (b <* c)
bar = (a <$> b) <* c

Then I scrapped hexStr2Int completely and instead introduced hexStr:

hexStr = Prelude.read . ("0x" ++) <$> many1 hexDigit

This means that parseAddress can be rewritten to:

parseAddress = Address <$>
    hexStr <* char '-' <*>
    hexStr

Rather than, as Conal suggested, introduce an infix operation that addresses the pattern (a <* char ' ') <*> b I decided to do something about a <* char c. I feel Conal's suggestion, while shortening the code more than my solution, goes against my wish to not hide things. This is the definition of <##>:

(<##>) l r = l <* char r

After this I rewrote parseAddress into:

parseAddress = Address <$>
    hexStr <##> '-' <*>
    hexStr

The pattern (== c) <$> anyChar appears three times in parsePerms so it got a name and moved down into the where clause. I also modified cA to use pattern matching. I haven't spent much time considering error handling in the parser, so I didn't introduce a pattern matching everything else.

parsePerms = Perms <$>
    pP 'r' <*>
    pP 'w' <*>
    pP 'x' <*>
    (cA <$> anyChar)

    where
        pP c = (== c) <$> anyChar
        cA 'p' = Private
        cA 's' = Shared

The last change I did was remove a bunch of parentheses. I'm always a little hesitant removing parentheses and relying on precedence rules, I find I'm even more hesitant doing it when programming Haskell. Probably due to Haskell having a lot of infix operators that I'm unused to.

The rest of the parser now looks like this:

parseDevice = Device <$>
    hexStr <##> ':' <*>
    hexStr

parseRegion = MemRegion <$>
    parseAddress <##> ' ' <*>
    parsePerms <##> ' ' <*>
    hexStr <##> ' ' <*>
    parseDevice <##> ' ' <*>
    (Prelude.read <$> many1 digit) <##> ' ' <*>
    (parsePath <|> string "")

    where
        parsePath = (many1 $ char ' ') *> (many1 anyChar)

I think these changes address most of the comments Conal and Twan made on the previous part. Where they don't I hope I've explained why I decided not to take their advice.

Comment by Jedaï:

That's really pretty ! Code you can read, but concise, Haskell is really good at that, though I need to look at how Applicative works its magic. :)

Good work !

Comment by Conal Elliot:

Magnus wrote

I also don’t want to hide things, e.g. I don’t want to introduce a function that turns (a <* char ' ') <*> b into a <#> b.

I'm puzzled about this comment. Aren't all of your definitions (as well as much of Parsec and other Haskell libraries) "hiding things"?

What appeals to me about a <#> b = (a <* char ' ') <*> b (and similarly for, say "a <:> b", is that it captures the combination of a character separator and <*>-style application. As your example illustrates (and hadn't previously occurred to me), this combination is very common.

Response to Conal:

Conal, you are right and I was unclear in what I meant. Basically I like the idea of reading the parseXxx functions and see the structure of the original maps file. At the moment I think that

parseAddress = Address <$>
    hexStr <##> '-' <*>
    hexStr

better reflects the structure of the maps file than hiding away the separator inside an operator. I also find it doesn't require me to carry a lot of "mental baggage" when reading the code (I suspect this is the thing that's been bothering me with the love of introducing operators that seems so prevalent among Haskell developers, thanks for helping me put a finger on it). However, your persistence might be paying off ;-) I'm warming to the idea. I just have to come up with a scheme for naming operators that allows easy reading of the code.

Comment by Conal Elliott:

Oh! I'm finally getting what you've meant about "hiding things" vs "reflect[ing] the structure of the maps file". I think you want the separator characters to show up in the parser, and between the sub-parsers that they separate.

Maybe what's missing in my <#> suggestion is that the choice of the space character as a separator is far from obvious, and I guess that's what you're saying about "mental baggage" naming the operators for easy reading.

I suppose you could use sepSpace and sepColon as operator names.

parseAddress = Address <$> hexStr `sepColon` hexStr

parseRegion = MemRegion <$>
    parseAddress `sepSpace`
    parsePerms `sepSpace`
    ...

Still, an actual space/colon character would probably be clearer. For colon, you could use <:>, but what for space?

Response to Conal:

Conal, that's exactly what I mean, just much more clearly expressed than I could ever hope to do.

I too was thinking of the problem with space in an operator…

Comment here.