Adventures in parsing
I've long wanted to dip my toes in the Parsec
water. I've made some attempts
before, but always stumbled on something that put me in the doldrums for so long
that I managed to repress all memories of ever having tried. A few files
scattered in my ~/devo/test/haskell
directory tells the story of my failed
attempts. Until now that is :-)
I picked a nice and regular task for my first real attempt: parsing
/proc/<pid>/maps
. First a look at the man-page offers a good description of
the format of a line:
address perms offset dev inode pathname 08048000-08056000 r-xp 00000000 03:0c 64593 /usr/sbin/gpm
So, I started putting together some datatypes. First off the address range:
data Address = Address { start :: Integer, end :: Integer } deriving Show
Then I decided that the 's'/'p' in the permissions should be called Access
:
data Access = Shared | Private deriving Show
The basic permissions (rwx
) are simply represented as booleans:
data Perms = Perms { read :: Bool, write :: Bool, executable :: Bool, access :: Access } deriving Show
The device is straightforward as well:
data Device = Device { major :: Integer, minor :: Integer } deriving Show
At last I tie it all together in a final datatype that represents a memory region:
data MemRegion = MemRegion { address :: Address, perms :: Perms, offset :: Integer, device :: Device, inode :: Integer, pathname :: String } deriving Show
All types derive Show
(and receive default implementations of show
, at least
when using GHC) so that they are easy to print.
Now, on to the actual "parsec-ing". Faced with the option of writing it top-down
or bottom-up I chose the latter. However, since the format of a single line in
the maps
file is so simple it's easy to imagine what the final function will
look like. I settled on bottom-up since the datatypes provide me with such an
obvious splitting of the line. First off, parsing the address range:
parseAddress = let hexStr2Int = Prelude.read . ("0x" ++) in do start <- many1 hexDigit char '-' end <- many1 hexDigit return $ Address (hexStr2Int start) (hexStr2Int end)
Since the addresses themselves are in hexadecimal and always are of at least
length 1 I use many1 hexDigit
to read them. I think it would be safe to assume
the addresses always are 8 characters (at least on a 32-bit machine) so it would
be possible to use count 8 hexDigit
but I haven't tried it. I've found two
ways of converting a string representation of a hexadecimal number into an
Integer
. Above I use the fact that Prelude.read
interprets a string
beginning with 0x
as a hexadecimal number. The other way I've found is the
slightly less readable fst . (!! 0) . readHex
. According to the man-page the
addresses are separated by a single dash so I've hardcoded that in there.
Testing the function is fairly simple. Using gchi
, first load the source file
then use parse
:
*Main> parse parseAddress "" "0-1" Right (Address {start = 0, end = 1}) *Main> parse parseAddress "hhh" "01234567-89abcdef" Right (Address {start = 19088743, end = 2309737967})
Seems to work well enough. :-)
Next up, parsing the permissions. This is so very straightforward that I don't think I need to comment on it:
parsePerms = let cA a = case a of 'p' -> Private 's' -> Shared in do r <- anyChar w <- anyChar x <- anyChar a <- anyChar return $ Perms (r == 'r') (w == 'w') (x == 'x') (cA a)
For parsing the device information I use the same strategy as for the address range above, this time however the separating charachter is a colon:
parseDevice = let hexStr2Int = Prelude.read . ("0x" ++) in do maj <- many1 digit char ':' min <- many1 digit return $ Device (hexStr2Int maj) (hexStr2Int min)
Next is to tie it all together and create a MemRegion instance:
parseRegion = let hexStr2Int = Prelude.read . ("0x" ++) parsePath = (many1 $ char ' ') >> (many1 $ anyChar) in do addr <- parseAddress char ' ' perm <- parsePerms char ' ' offset <- many1 hexDigit char ' ' dev <- parseDevice char ' ' inode <- many1 digit char ' ' path <- parsePath <|> string "" return $ MemRegion addr perm (hexStr2Int offset) dev (Prelude.read inode) path
The only little trick here is that there are lines that lack the pathname. Here's an example from the man-page:
address perms offset dev inode pathname 08058000-0805b000 rwxp 00000000 00:00 0
It should be noted that it seems there is a space after the inode entry so I
keep a char ' '
in the main function. Then I try to parse the line for a path,
if there is none that attempt will fail immediately and instead I parse for an
empty string, parsePath <|> string ""
. The pathname seems to be prefixed with
a fixed number of spaces, but I'm lazy and just consume one or more. I'm not
sure exactly what characters are allowed in the pathname itself so I'm lazy once
more and just gobble up whatever I find.
To exercise what I had so far I decided to write a function that reads the
maps
file for a specific process, based on its pid
, parses the contents and
collects all the MemRegion
instances in a list.
getMemRegions pid = let fp = "/proc" </> show pid </> "maps" doParseLine' = parse parseRegion "parseRegion" doParseLine l = case (doParseLine' l) of Left _ -> error "Failed to parse line" Right x -> x in do mapContent <- liftM lines $ readFile fp return $ map doParseLine mapContent
The only thing that really is going on here is that the lines are passed from inside an IO monad into the Parser monad and then back again. After this I can try it out by:
*Main> getMemRegions 1
This produces a lot of output so while playing with it I limited the mapping to
the four first lines by using take
. The last line then becomes:
return $ map doParseLine (take 4 mapContent)
Now it's easy to add a main
that uses the first command line argument as the
pid
:
main = do pid <- liftM (Prelude.read . (!! 0)) getArgs regs <- getMemRegions pid mapM_ (putStrLn . show) regs
Well, that concludes my first adventure in parsing :-)
Edit (27-05-2007): I received an email asking for it so here are the import statements I ended up with:
import Control.Monad import System import System.FilePath import Text.ParserCombinators.Parsec
Comment by Conal Elliot:
Congrats on your parser!
Here's an idea that for getting more functional/applicative formulations.
Replace all of the explicitly sequential (do
) parsec code with liftM
,
liftM2
, …. From a quick read-through, I think you can do it. For
parseRegion
, you could use an auxiliary function that discards a following
space, e.g. thenSpace (many1 digit)
.
Also, play with factoring out some of the repeated patterns in parseAddress
,
parsePerms
and parseDevice
.
The more I play with refactoring in my code, the more elegant it gets and the more insight I get. Have fun!
Response to Conal:
Good suggestion. At least if I understand what you mean :-)
Something along the lines of
thenChar c f = f >>= (\ r -> char c >> return r)
with a specialised one for spaces maybe
thenSpace = thenChar ' '
I suppose liftM
and friends can be employed to remove the function calling in
the creation of Address
, Device
and MemRegion
. I'll try to venture into
Parsec
territory once again soon and report on my findings. :-)