Adventures in parsing

I’ve long wanted to dip my toes in the Parsec water. I’ve made some attempts before, but always stumbled on something that put me in the doldrums for so long that I managed to repress all memories of ever having tried. A few files scattered in my ~/devo/test/haskell directory tells the story of my failed attempts. Until now that is :-)

I picked a nice and regular task for my first real attempt: parsing /proc/<pid>/maps. First a look at the man-page offers a good description of the format of a line:

address           perms offset  dev   inode      pathname
08048000-08056000 r-xp 00000000 03:0c 64593      /usr/sbin/gpm

So, I started putting together some datatypes. First off the address range:

Then I decided that the ‘s’/‘p’ in the permissions should be called Access:

The basic permissions (rwx) are simply represented as booleans:

The device is straightforward as well:

At last I tie it all together in a final datatype that represents a memory region:

All types derive Show (and receive default implementations of show, at least when using GHC) so that they are easy to print.

Now, on to the actual “parsec-ing”. Faced with the option of writing it top-down or bottom-up I chose the latter. However, since the format of a single line in the maps file is so simple it’s easy to imagine what the final function will look like. I settled on bottom-up since the datatypes provide me with such an obvious splitting of the line. First off, parsing the address range:

Since the addresses themselves are in hexadecimal and always are of at least length 1 I use many1 hexDigit to read them. I think it would be safe to assume the addresses always are 8 characters (at least on a 32-bit machine) so it would be possible to use count 8 hexDigit but I haven’t tried it. I’ve found two ways of converting a string representation of a hexadecimal number into an Integer. Above I use the fact that Prelude.read interprets a string beginning with 0x as a hexadecimal number. The other way I’ve found is the slightly less readable fst . (!! 0) . readHex. According to the man-page the addresses are separated by a single dash so I’ve hardcoded that in there.

Testing the function is fairly simple. Using gchi, first load the source file then use parse:

*Main> parse parseAddress "" "0-1"
Right (Address {start = 0, end = 1})
*Main> parse parseAddress "hhh" "01234567-89abcdef"
Right (Address {start = 19088743, end = 2309737967})

Seems to work well enough. :-)

Next up, parsing the permissions. This is so very straightforward that I don’t think I need to comment on it:

For parsing the device information I use the same strategy as for the address range above, this time however the separating charachter is a colon:

Next is to tie it all together and create a MemRegion instance:

The only little trick here is that there are lines that lack the pathname. Here’s an example from the man-page:

address           perms offset  dev   inode      pathname
08058000-0805b000 rwxp 00000000 00:00 0

It should be noted that it seems there is a space after the inode entry so I keep a char ' ' in the main function. Then I try to parse the line for a path, if there is none that attempt will fail immediately and instead I parse for an empty string, parsePath <|> string "". The pathname seems to be prefixed with a fixed number of spaces, but I’m lazy and just consume one or more. I’m not sure exactly what characters are allowed in the pathname itself so I’m lazy once more and just gobble up whatever I find.

To exercise what I had so far I decided to write a function that reads the maps file for a specific process, based on its pid, parses the contents and collects all the MemRegion instances in a list.

The only thing that really is going on here is that the lines are passed from inside an IO monad into the Parser monad and then back again. After this I can try it out by:

*Main> getMemRegions 1

This produces a lot of output so while playing with it I limited the mapping to the four first lines by using take. The last line then becomes:

Now it’s easy to add a main that uses the first command line argument as the pid:

Well, that concludes my first adventure in parsing :-)

[Edit 27-05-2007 13:15]

I received an email asking for it so here are the import statements I ended up with:

⟸ I'm on a planet... More adventures in parsing ⟹

Conal Elliott

Congrats on your parser!

Here’s an idea that for getting more functional/applicative formulations. Replace all of the explicitly sequential (“do”) parsec code with liftM, liftM2, …. From a quick read-through, I think you can do it. For parseRegion, you could use an auxiliary function that discards a following space, e.g. “thenSpace (many1 digit)”.

Also, play with factoring out some of the repeated patterns in parseAddress, parsePerms and parseDevice.

The more I play with refactoring in my code, the more elegant it gets and the more insight I get. Have fun!

Magnus

Conal,

Good suggestion. At least if I understand what you mean :-)

Something along the lines of

with a specialised one for spaces maybe

I suppose liftM and friends can be employed to remove the function calling in the creation of Address, Device and MemRegion. I’ll try to venture into Parsec territory once again soon and report on my findings. :-)

Richard Careaga

I saw a reference to this on the Haskell Wiki but the links were broken. So that no pilgrim may be left behind I tracked down the revised URLs and got edit permissions to update it at https://wiki.haskell.org/Parsec

Thanks for the guidance.

[@technocrat](https://twitter.com/technocrat)

Magnus Therning

Thanks Richard. When I moved away from Wordpad for my site I pretty much broke every single link that’s out there and I’ve clearly not found and changed all of them.

Leave a comment