27 May 2007

Adventures in parsing

I've long wanted to dip my toes in the Parsec water. I've made some attempts before, but always stumbled on something that put me in the doldrums for so long that I managed to repress all memories of ever having tried. A few files scattered in my ~/devo/test/haskell directory tells the story of my failed attempts. Until now that is :-)

I picked a nice and regular task for my first real attempt: parsing /proc/<pid>/maps. First a look at the man-page offers a good description of the format of a line:

address           perms offset  dev   inode      pathname
08048000-08056000 r-xp 00000000 03:0c 64593      /usr/sbin/gpm

So, I started putting together some datatypes. First off the address range:

data Address = Address { start :: Integer, end :: Integer }
    deriving Show

Then I decided that the 's'/'p' in the permissions should be called Access:

data Access = Shared | Private
    deriving Show

The basic permissions (rwx) are simply represented as booleans:

data Perms = Perms {
        read :: Bool,
        write :: Bool,
        executable :: Bool,
        access :: Access
    }
    deriving Show

The device is straightforward as well:

data Device = Device { major :: Integer, minor :: Integer }
    deriving Show

At last I tie it all together in a final datatype that represents a memory region:

data MemRegion = MemRegion {
        address :: Address,
        perms :: Perms,
        offset :: Integer,
        device :: Device,
        inode :: Integer,
        pathname :: String
    }
    deriving Show

All types derive Show (and receive default implementations of show, at least when using GHC) so that they are easy to print.

Now, on to the actual "parsec-ing". Faced with the option of writing it top-down or bottom-up I chose the latter. However, since the format of a single line in the maps file is so simple it's easy to imagine what the final function will look like. I settled on bottom-up since the datatypes provide me with such an obvious splitting of the line. First off, parsing the address range:

parseAddress = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in do
        start <- many1 hexDigit
        char '-'
        end <- many1 hexDigit
        return $ Address (hexStr2Int start) (hexStr2Int end)

Since the addresses themselves are in hexadecimal and always are of at least length 1 I use many1 hexDigit to read them. I think it would be safe to assume the addresses always are 8 characters (at least on a 32-bit machine) so it would be possible to use count 8 hexDigit but I haven't tried it. I've found two ways of converting a string representation of a hexadecimal number into an Integer. Above I use the fact that Prelude.read interprets a string beginning with 0x as a hexadecimal number. The other way I've found is the slightly less readable fst . (!! 0) . readHex. According to the man-page the addresses are separated by a single dash so I've hardcoded that in there.

Testing the function is fairly simple. Using gchi, first load the source file then use parse:

*Main> parse parseAddress "" "0-1"
Right (Address {start = 0, end = 1})
*Main> parse parseAddress "hhh" "01234567-89abcdef"
Right (Address {start = 19088743, end = 2309737967})

Seems to work well enough. :-)

Next up, parsing the permissions. This is so very straightforward that I don't think I need to comment on it:

parsePerms = let
        cA a = case a of
            'p' -> Private
            's' -> Shared
    in do
        r <- anyChar
        w <- anyChar
        x <- anyChar
        a <- anyChar
        return $ Perms (r == 'r') (w == 'w') (x == 'x') (cA a)

For parsing the device information I use the same strategy as for the address range above, this time however the separating charachter is a colon:

parseDevice = let
        hexStr2Int = Prelude.read . ("0x" ++)
    in do
        maj <- many1 digit
        char ':'
        min <- many1 digit
        return $ Device (hexStr2Int maj) (hexStr2Int min)

Next is to tie it all together and create a MemRegion instance:

parseRegion = let
        hexStr2Int = Prelude.read . ("0x" ++)
        parsePath = (many1 $ char ' ') >> (many1 $ anyChar)
    in do
        addr <- parseAddress
        char ' '
        perm <- parsePerms
        char ' '
        offset <- many1 hexDigit
        char ' '
        dev <- parseDevice
        char ' '
        inode <- many1 digit
        char ' '
        path <- parsePath <|> string ""
        return $ MemRegion addr perm (hexStr2Int offset) dev (Prelude.read inode) path

The only little trick here is that there are lines that lack the pathname. Here's an example from the man-page:

address           perms offset  dev   inode      pathname
08058000-0805b000 rwxp 00000000 00:00 0

It should be noted that it seems there is a space after the inode entry so I keep a char ' ' in the main function. Then I try to parse the line for a path, if there is none that attempt will fail immediately and instead I parse for an empty string, parsePath <|> string "". The pathname seems to be prefixed with a fixed number of spaces, but I'm lazy and just consume one or more. I'm not sure exactly what characters are allowed in the pathname itself so I'm lazy once more and just gobble up whatever I find.

To exercise what I had so far I decided to write a function that reads the maps file for a specific process, based on its pid, parses the contents and collects all the MemRegion instances in a list.

getMemRegions pid = let
        fp = "/proc" </> show pid </> "maps"
        doParseLine' = parse parseRegion "parseRegion"
        doParseLine l = case (doParseLine' l) of
            Left _ -> error "Failed to parse line"
            Right x -> x
    in do
        mapContent <- liftM lines $ readFile fp
        return $ map doParseLine mapContent

The only thing that really is going on here is that the lines are passed from inside an IO monad into the Parser monad and then back again. After this I can try it out by:

*Main> getMemRegions 1

This produces a lot of output so while playing with it I limited the mapping to the four first lines by using take. The last line then becomes:

return $ map doParseLine (take 4 mapContent)

Now it's easy to add a main that uses the first command line argument as the pid:

main = do
    pid <- liftM (Prelude.read . (!! 0)) getArgs
    regs <- getMemRegions pid
    mapM_ (putStrLn . show) regs

Well, that concludes my first adventure in parsing :-)

Edit (27-05-2007): I received an email asking for it so here are the import statements I ended up with:

import Control.Monad
import System
import System.FilePath
import Text.ParserCombinators.Parsec

Comment by Conal Elliot:

Congrats on your parser!

Here's an idea that for getting more functional/applicative formulations. Replace all of the explicitly sequential (do) parsec code with liftM, liftM2, …. From a quick read-through, I think you can do it. For parseRegion, you could use an auxiliary function that discards a following space, e.g. thenSpace (many1 digit).

Also, play with factoring out some of the repeated patterns in parseAddress, parsePerms and parseDevice.

The more I play with refactoring in my code, the more elegant it gets and the more insight I get. Have fun!

Response to Conal:

Good suggestion. At least if I understand what you mean :-)

Something along the lines of

thenChar c f = f >>= (\ r -> char c >> return r)

with a specialised one for spaces maybe

thenSpace = thenChar ' '

I suppose liftM and friends can be employed to remove the function calling in the creation of Address, Device and MemRegion. I'll try to venture into Parsec territory once again soon and report on my findings. :-)

Comment here.