Posts tagged "parsing":

05 Jun 2007

Adventures in parsing, part 4

I received a few comments on part 3 of this little mini-series and I just wanted to address them. While doing this I still want the main functions of the parser parseXxx to read like the maps file itself. That means I want to avoid "reversing order" like thenChar and thenSpace did in part 2. I also don't want to hide things, e.g. I don't want to introduce a function that turns (a <* char ' ') <*> b into a <#> b.

So, first up is to do something about hexStr2Int <$> many1 hexDigit which appears all over the place. I made it appear in even more places by moving around a few parentheses; the following two functions are the same:

foo = a <$> (b <* c)
bar = (a <$> b) <* c

Then I scrapped hexStr2Int completely and instead introduced hexStr:

hexStr = Prelude.read . ("0x" ++) <$> many1 hexDigit

This means that parseAddress can be rewritten to:

parseAddress = Address <$>
    hexStr <* char '-' <*>
    hexStr

Rather than, as Conal suggested, introduce an infix operation that addresses the pattern (a <* char ' ') <*> b I decided to do something about a <* char c. I feel Conal's suggestion, while shortening the code more than my solution, goes against my wish to not hide things. This is the definition of <##>:

(<##>) l r = l <* char r

After this I rewrote parseAddress into:

parseAddress = Address <$>
    hexStr <##> '-' <*>
    hexStr

The pattern (== c) <$> anyChar appears three times in parsePerms so it got a name and moved down into the where clause. I also modified cA to use pattern matching. I haven't spent much time considering error handling in the parser, so I didn't introduce a pattern matching everything else.

parsePerms = Perms <$>
    pP 'r' <*>
    pP 'w' <*>
    pP 'x' <*>
    (cA <$> anyChar)

    where
        pP c = (== c) <$> anyChar
        cA 'p' = Private
        cA 's' = Shared

The last change I did was remove a bunch of parentheses. I'm always a little hesitant removing parentheses and relying on precedence rules, I find I'm even more hesitant doing it when programming Haskell. Probably due to Haskell having a lot of infix operators that I'm unused to.

The rest of the parser now looks like this:

parseDevice = Device <$>
    hexStr <##> ':' <*>
    hexStr

parseRegion = MemRegion <$>
    parseAddress <##> ' ' <*>
    parsePerms <##> ' ' <*>
    hexStr <##> ' ' <*>
    parseDevice <##> ' ' <*>
    (Prelude.read <$> many1 digit) <##> ' ' <*>
    (parsePath <|> string "")

    where
        parsePath = (many1 $ char ' ') *> (many1 anyChar)

I think these changes address most of the comments Conal and Twan made on the previous part. Where they don't I hope I've explained why I decided not to take their advice.

Comment by Jedaï:

That's really pretty ! Code you can read, but concise, Haskell is really good at that, though I need to look at how Applicative works its magic. :)

Good work !

Comment by Conal Elliot:

Magnus wrote

I also don’t want to hide things, e.g. I don’t want to introduce a function that turns (a <* char ' ') <*> b into a <#> b.

I'm puzzled about this comment. Aren't all of your definitions (as well as much of Parsec and other Haskell libraries) "hiding things"?

What appeals to me about a <#> b = (a <* char ' ') <*> b (and similarly for, say "a <:> b", is that it captures the combination of a character separator and <*>-style application. As your example illustrates (and hadn't previously occurred to me), this combination is very common.

Response to Conal:

Conal, you are right and I was unclear in what I meant. Basically I like the idea of reading the parseXxx functions and see the structure of the original maps file. At the moment I think that

parseAddress = Address <$>
    hexStr <##> '-' <*>
    hexStr

better reflects the structure of the maps file than hiding away the separator inside an operator. I also find it doesn't require me to carry a lot of "mental baggage" when reading the code (I suspect this is the thing that's been bothering me with the love of introducing operators that seems so prevalent among Haskell developers, thanks for helping me put a finger on it). However, your persistence might be paying off ;-) I'm warming to the idea. I just have to come up with a scheme for naming operators that allows easy reading of the code.

Comment by Conal Elliott:

Oh! I'm finally getting what you've meant about "hiding things" vs "reflect[ing] the structure of the maps file". I think you want the separator characters to show up in the parser, and between the sub-parsers that they separate.

Maybe what's missing in my <#> suggestion is that the choice of the space character as a separator is far from obvious, and I guess that's what you're saying about "mental baggage" naming the operators for easy reading.

I suppose you could use sepSpace and sepColon as operator names.

parseAddress = Address <$> hexStr `sepColon` hexStr

parseRegion = MemRegion <$>
    parseAddress `sepSpace`
    parsePerms `sepSpace`
    ...

Still, an actual space/colon character would probably be clearer. For colon, you could use <:>, but what for space?

Response to Conal:

Conal, that's exactly what I mean, just much more clearly expressed than I could ever hope to do.

I too was thinking of the problem with space in an operator…

03 Jun 2007

Adventures in parsing, part 3

I got a great many comments, at least by my standards, on my earlier two posts on parsing in Haskell. Especially on the latest one. Conal posted a comment on the first pointing me towards liftM and its siblings, without telling me that it would only be the first step towards "applicative style". So, here I go again…

First off, importing Control.Applicative. Apparently <|> is defined in both Applicative and in Parsec. I do use <|> from Parsec so preventing importing it from Applicative seemed like a good idea:

import Control.Applicative hiding ( (<|>) )

Second, Cale pointed out that I need to make an instance for Control.Applicative.Applicative for GenParser. He was nice enough to point out how to do that, leaving syntax the only thing I had to struggle with:

instance Applicative (GenParser c st) where
    pure = return
    (<*>) = ap

I decided to take baby-steps and I started with parseAddress. Here's what it used to look like:

parseAddress = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in do
        start <- liftM hexStr2Int $ thenChar '-' $ many1 hexDigit
        end <- liftM hexStr2Int $ many1 hexDigit
        return $ Address start end

On Twan's suggestion I rewrote it using where rather than let ... in and since this was my first function I decided to go via the ap function (at the same time I broke out hexStr2Int since it's used in so many places):

parseAddress = do
    start <- return hexStr2Int `ap` (thenChar '-' $ many1 hexDigit)
    end <- return hexStr2Int `ap` (many1 hexDigit)
    return $ Address start end

Then on to applying some functions from Applicative:

parseAddress = Address start end
    where
        start = hexStr2Int <$> (thenChar '-' $ many1 hexDigit)
        end = hexStr2Int <$> (many1 hexDigit)

By now the use of thenChar looks a little silly so I changed that part into many1 hexDigit <* char '-' instead. Finally I removed the where part altogether and use <*> to string it all together:

parseAddress = Address <$>
    (hexStr2Int <$> many1 hexDigit <* char '-') <*>
    (hexStr2Int <$> (many1 hexDigit))

From here on I skipped the intermediate steps and went straight for the last form. Here's what I ended up with:

parsePerms = Perms <$>
    ( (== 'r') <$> anyChar) <*>
    ( (== 'w') <$> anyChar) <*>
    ( (== 'x') <$> anyChar) <*>
    (cA <$> anyChar)

    where
        cA a = case a of
            'p' -> Private
            's' -> Shared

parseDevice = Device <$>
    (hexStr2Int <$> many1 hexDigit <* char ':') <*>
    (hexStr2Int <$> (many1 hexDigit))

parseRegion = MemRegion <$>
    (parseAddress <* char ' ') <*>
    (parsePerms <* char ' ') <*>
    (hexStr2Int <$> (many1 hexDigit <* char ' ')) <*>
    (parseDevice <* char ' ') <*>
    (Prelude.read <$> (many1 digit <* char ' ')) <*>
    (parsePath <|> string "")

    where
        parsePath = (many1 $ char ' ') *> (many1 anyChar)

I have to say I'm fairly pleased with this version of the parser. It reads about as easy as the first version and there's none of the "reversing" that thenChar introduced.

Comment by Conal Elliott:

A thing of beauty! I'm glad you stuck with it, Magnus.

Some much smaller points:

The pattern (== c) <$> anyChar (nicely written, btw) arises three times, so it might merit a name.
Similarly for hexStr2Int <$> many1 hexDigit, especially when you rewrite f <$> (a <* b) to (f <$> a) <* b.
The pattern (a <* char ' ') <*> b comes up a lot. How about naming it also, with a nice infix op, say a <#> b?
The cA definition could use pattern matching instead (e.g., cA 'p' = Private and cA 's' = Shared).
Some of your parens are unnecessary (3rd line of parseDevice and last of parseRegion), since application binds more tightly than infix ops.

Comment by Twan van Laarhoven:

First of all, note that you don't need parentheses around parseSomething <* char ' '.

You can also simplify things a bit more by combining hexStr2Int <$> many1 hexDigit into a function, then you could say:

parseHex = hexStr2Int <$> many1 hexDigit
parseAddress = Address <$> parseHex <* char '-' <*> parseHex
parseDevice  = Device <$> parseHex <</em> char ':' <*> parseHex

Also, in cA, should there be a case for character other than 'p' or 's'? Otherwise the program could fail with a pattern match error.

Response to Conal and Twan:

Conal and Twan, thanks for your suggestions. I'll put them into practice and post the "final" result as soon as I find some time.

29 May 2007

More adventures in parsing

I received an interesting comment from Conal Elliott on my previous post on parsing I have to admit I wasn't sure I understood him at first, I'm still not sure I do, but I think I have an idea of what he means :-)

Basically my code is very sequential in that I use the do construct everywhere in the parsing code. Personally I thought that makes the parser very easy to read since the code very much mimics the structure of the maps file. I do realise the code isn't very "functional" though so I thought I'd take Conal's comments to heart and see what the result would be.

Let's start with observation that every entity in a line is separated by a space. However some things are separated by other characters. So the first thing I did was write a higher-order function that first reads something, then reads a character and returns the first thing that was read:

thenChar c f = f >>= (\ r -> char c >> return r)

Since space is used as a separator so often I added a short-cut for that:

thenSpace  = thenChar ' '

Then I put that to use on parseAddress:

parseAddress = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in do
        start <- thenChar '-' $ many1 hexDigit
        end <- many1 hexDigit
        return $ Address (hexStr2Int start) (hexStr2Int end)

Modifying the other parsing functions using =thenChar~ and thenSpace is straight forward.

I'm not entirely sure I understand what Conal meant with the part about liftM in his comment. I suspect his referring to the fact that I first read characters and then convert them in the "constructors". By using liftM I can move the conversion "up in the code". Here's parseAddress after I've moved the calls to hexStr2Int:

parseAddress = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in do
        start <- liftM hexStr2Int $ thenChar '-' $ many1 hexDigit
        end <- liftM hexStr2Int $ many1 hexDigit
        return $ Address start end

After modifying the other parsing functions in a similar way I ended up with this:

parsePerms = let
        cA a = case a of
            'p' -> Private
            's' -> Shared
    in do
        r <- liftM (== 'r') anyChar
        w <- liftM (== 'w') anyChar
        x <- liftM (== 'x') anyChar
        a <- liftM cA anyChar
        return $ Perms r w x a

parseDevice = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in do
        maj <- liftM hexStr2Int $ thenChar ':' $ many1 hexDigit
        min <- liftM hexStr2Int $ many1 hexDigit
        return $ Device maj min

parseRegion = let
        hexStr2Int = Prelude.read . ("0x" ++)
        parsePath = (many1 $ char ' ') >> (many1 $ anyChar)
    in do
        addr <- thenSpace parseAddress
        perm <- thenSpace parsePerms
        offset <- liftM hexStr2Int $ thenSpace $ many1 hexDigit
        dev <- thenSpace parseDevice
        inode <- liftM Prelude.read $ thenSpace $ many1 digit
        path <- parsePath <|> string ""
        return $ MemRegion addr perm offset dev inode path

Is this code more "functional"? Is it easier to read? You'll have to be the judge of that…

Conal, if I got the intention of your comment completely wrong then feel free to tell me I'm an idiot ;-)

Comment by Holger:

Your story and Conal's comment inspired me to play around with liftM and I came up with this version of parseAddress:

parseAddress = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in
        liftM3 (\ x _ y -> Address (hexStr2Int x) (hexStr2Int y)) (many1 hexDigit) (char '-') (many1 hexDigit)

You could probably rewrite all the functions in a similar way, but honestly I find your original code in do-notation much easier to read.

Response to Holger:

Yeah, the same line of thought made it a little difficult to fall asleep yesterday (yes, I know, nerdiness taken to new levels). My thoughts was something like this:

parseAddress = let
        hexStr2Int = ...
    in
        liftM2 Address
            (liftM hexStr2Int $ thenChar '-' $ many1 hexDigit)
            (liftM hexStr2Int $ many1 hexDigit)

I agree with you on readability. I also wonder if laziness could bite back or if `liftM2` guarantees an order of evaluation.

Comment by Holger:

I just looked at the source of liftM2/3 and it seems that it basically just resolves to an expression in do-notation. So it's just a shortcut and therefore should yield the same program.

Comment by Cale Gibbard:

liftM2 guarantees an ordering on the monadic computation, because it's defined like:

liftM2 f x y = do { u <- x; v <- y; return (f u v) }

Though, that's a little different from guaranteeing an order of evaluation – depending on the monad, the order of evaluation will vary. In any case, it shouldn't be much different from what you originally had.

Another thing you might like to play around with, at least in your head, is the fact that:

liftM2 f x y = return f `ap` x `ap` y

liftM3 f x y z = return f `ap` x `ap` y `ap` z

and so on, which leads up to the style embodied by the Control.Applicative library, though if you really want to play around with that, you'll need to write a quick instance of Applicative for GenParser, which should just consist of making pure = return and (<*>) = ap.

Comment by Twan van Laarhoven:

The best way to make parsing code readable is to use Data.Applicative. Also, most people prefer where to let..in if possible. This gives something like:

parseHexStr = Prelude.read . ("0x" ++)  many1 hexDigit
parsePath   = many1 (char ' ') *> many1 anyChar

parseAddress = Address  hexStr *> char '-'  hexStr

parseRegion = MemRegion
            parseAddress *> char ' '
            parsePerms   *> char ' '
            parseHexStr  *> char ' '
            parseDevice  *> char ' '
            (Prelude.read  many1 digit) *> char ' '
            (parsePath  return "")

Basicly (liftM# f x y z) can be written as f x <*> y <*> z.

Comment by Nick:

I find the thenSpace a bit difficult to read.I think something like this is more natural, as it maintains the left-to-right relationship of the parsed data and the following space:

aChar c r = char c >> return r
aSpace = aChar ' '

...
    do
        start <-  many1 hexDigit >> aChar
....

note that I haven't tested this and my haskell-fu is not very strong, so I could be way off here.

Comment by Conal Elliot:

Yes, that's exactly the direction i had in mind. once you switch from "do" style to "liftM#" style, it's a simple step to replace the monad operators to applicative functor operators.

Response to Nick:

Nick, yes your way of writing it is easier to read. You'll need to change to using >>= though:

do
    start <- many1 hexDigit >>= aChar '-'

Then you can mix in `liftM` as well:

do
    start <- liftM hexStr2Int $ many1 hexDigit >>= aChar '-'

to do the conversion. However, I think my initial version is even clearer:

do
    start <- liftM hexStr2Int $ many1 hexDigitc
    char '-'

Response to Conal Elliot:

Conal, ok, you're one sneaky little b… ;-) I'll have to look at the applicative operators to see what I think of them.

27 May 2007

Adventures in parsing

I've long wanted to dip my toes in the Parsec water. I've made some attempts before, but always stumbled on something that put me in the doldrums for so long that I managed to repress all memories of ever having tried. A few files scattered in my ~/devo/test/haskell directory tells the story of my failed attempts. Until now that is :-)

I picked a nice and regular task for my first real attempt: parsing /proc/<pid>/maps. First a look at the man-page offers a good description of the format of a line:

address           perms offset  dev   inode      pathname
08048000-08056000 r-xp 00000000 03:0c 64593      /usr/sbin/gpm

So, I started putting together some datatypes. First off the address range:

data Address = Address { start :: Integer, end :: Integer }
    deriving Show

Then I decided that the 's'/'p' in the permissions should be called Access:

data Access = Shared | Private
    deriving Show

The basic permissions (rwx) are simply represented as booleans:

data Perms = Perms {
        read :: Bool,
        write :: Bool,
        executable :: Bool,
        access :: Access
    }
    deriving Show

The device is straightforward as well:

data Device = Device { major :: Integer, minor :: Integer }
    deriving Show

At last I tie it all together in a final datatype that represents a memory region:

data MemRegion = MemRegion {
        address :: Address,
        perms :: Perms,
        offset :: Integer,
        device :: Device,
        inode :: Integer,
        pathname :: String
    }
    deriving Show

All types derive Show (and receive default implementations of show, at least when using GHC) so that they are easy to print.

Now, on to the actual "parsec-ing". Faced with the option of writing it top-down or bottom-up I chose the latter. However, since the format of a single line in the maps file is so simple it's easy to imagine what the final function will look like. I settled on bottom-up since the datatypes provide me with such an obvious splitting of the line. First off, parsing the address range:

parseAddress = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in do
        start <- many1 hexDigit
        char '-'
        end <- many1 hexDigit
        return $ Address (hexStr2Int start) (hexStr2Int end)

Since the addresses themselves are in hexadecimal and always are of at least length 1 I use many1 hexDigit to read them. I think it would be safe to assume the addresses always are 8 characters (at least on a 32-bit machine) so it would be possible to use count 8 hexDigit but I haven't tried it. I've found two ways of converting a string representation of a hexadecimal number into an Integer. Above I use the fact that Prelude.read interprets a string beginning with 0x as a hexadecimal number. The other way I've found is the slightly less readable fst . (!! 0) . readHex. According to the man-page the addresses are separated by a single dash so I've hardcoded that in there.

Testing the function is fairly simple. Using gchi, first load the source file then use parse:

*Main> parse parseAddress "" "0-1"
Right (Address {start = 0, end = 1})
*Main> parse parseAddress "hhh" "01234567-89abcdef"
Right (Address {start = 19088743, end = 2309737967})

Seems to work well enough. :-)

Next up, parsing the permissions. This is so very straightforward that I don't think I need to comment on it:

parsePerms = let
        cA a = case a of
            'p' -> Private
            's' -> Shared
    in do
        r <- anyChar
        w <- anyChar
        x <- anyChar
        a <- anyChar
        return $ Perms (r == 'r') (w == 'w') (x == 'x') (cA a)

For parsing the device information I use the same strategy as for the address range above, this time however the separating charachter is a colon:

parseDevice = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in do
        maj <- many1 digit
        char ':'
        min <- many1 digit
        return $ Device (hexStr2Int maj) (hexStr2Int min)

Next is to tie it all together and create a MemRegion instance:

parseRegion = let
        hexStr2Int = Prelude.read . ("0x" ++)
        parsePath = (many1 $ char ' ') >> (many1 $ anyChar)
    in do
        addr <- parseAddress
        char ' '
        perm <- parsePerms
        char ' '
        offset <- many1 hexDigit
        char ' '
        dev <- parseDevice
        char ' '
        inode <- many1 digit
        char ' '
        path <- parsePath <|> string ""
        return $ MemRegion addr perm (hexStr2Int offset) dev (Prelude.read inode) path

The only little trick here is that there are lines that lack the pathname. Here's an example from the man-page:

address           perms offset  dev   inode      pathname
08058000-0805b000 rwxp 00000000 00:00 0

It should be noted that it seems there is a space after the inode entry so I keep a char ' ' in the main function. Then I try to parse the line for a path, if there is none that attempt will fail immediately and instead I parse for an empty string, parsePath <|> string "". The pathname seems to be prefixed with a fixed number of spaces, but I'm lazy and just consume one or more. I'm not sure exactly what characters are allowed in the pathname itself so I'm lazy once more and just gobble up whatever I find.

To exercise what I had so far I decided to write a function that reads the maps file for a specific process, based on its pid, parses the contents and collects all the MemRegion instances in a list.

getMemRegions pid = let
        fp = "/proc" </> show pid </> "maps"
        doParseLine' = parse parseRegion "parseRegion"
        doParseLine l = case (doParseLine' l) of
            Left _ -> error "Failed to parse line"
            Right x -> x
    in do
        mapContent <- liftM lines $ readFile fp
        return $ map doParseLine mapContent

The only thing that really is going on here is that the lines are passed from inside an IO monad into the Parser monad and then back again. After this I can try it out by:

*Main> getMemRegions 1

This produces a lot of output so while playing with it I limited the mapping to the four first lines by using take. The last line then becomes:

return $ map doParseLine (take 4 mapContent)

Now it's easy to add a main that uses the first command line argument as the pid:

main = do
    pid <- liftM (Prelude.read . (!! 0)) getArgs
    regs <- getMemRegions pid
    mapM_ (putStrLn . show) regs

Well, that concludes my first adventure in parsing :-)

Edit (27-05-2007): I received an email asking for it so here are the import statements I ended up with:

import Control.Monad
import System
import System.FilePath
import Text.ParserCombinators.Parsec

Comment by Conal Elliot:

Congrats on your parser!

Here's an idea that for getting more functional/applicative formulations. Replace all of the explicitly sequential (do) parsec code with liftM, liftM2, …. From a quick read-through, I think you can do it. For parseRegion, you could use an auxiliary function that discards a following space, e.g. thenSpace (many1 digit).

Also, play with factoring out some of the repeated patterns in parseAddress, parsePerms and parseDevice.

The more I play with refactoring in my code, the more elegant it gets and the more insight I get. Have fun!

Response to Conal:

Good suggestion. At least if I understand what you mean :-)

Something along the lines of

thenChar c f = f >>= (\ r -> char c >> return r)

with a specialised one for spaces maybe

thenSpace = thenChar ' '

I suppose liftM and friends can be employed to remove the function calling in the creation of Address, Device and MemRegion. I'll try to venture into Parsec territory once again soon and report on my findings. :-)