Adventures in parsing
- Magnus Therning
I’ve long wanted to dip my toes in the Parsec
water. I’ve made some attempts before, but always stumbled on something that put me in the doldrums for so long that I managed to repress all memories of ever having tried. A few files scattered in my ~/devo/test/haskell
directory tells the story of my failed attempts. Until now that is :-)
I picked a nice and regular task for my first real attempt: parsing /proc/<pid>/maps
. First a look at the man-page offers a good description of the format of a line:
address perms offset dev inode pathname
08048000-08056000 r-xp 00000000 03:0c 64593 /usr/sbin/gpm
So, I started putting together some datatypes. First off the address range:
data Address = Address { start :: Integer, end :: Integer }
deriving Show
Then I decided that the ‘s’/‘p’ in the permissions should be called Access
:
data Access = Shared | Private
deriving Show
The basic permissions (rwx
) are simply represented as booleans:
data Perms = Perms {
read :: Bool,
write :: Bool,
executable :: Bool,
access :: Access
}deriving Show
The device is straightforward as well:
data Device = Device { major :: Integer, minor :: Integer }
deriving Show
At last I tie it all together in a final datatype that represents a memory region:
data MemRegion = MemRegion {
address :: Address,
perms :: Perms,
offset :: Integer,
device :: Device,
inode :: Integer,
pathname :: String
}deriving Show
All types derive Show
(and receive default implementations of show
, at least when using GHC) so that they are easy to print.
Now, on to the actual “parsec-ing”. Faced with the option of writing it top-down or bottom-up I chose the latter. However, since the format of a single line in the maps
file is so simple it’s easy to imagine what the final function will look like. I settled on bottom-up since the datatypes provide me with such an obvious splitting of the line. First off, parsing the address range:
= let
parseAddress = Prelude.read . ("0x" ++)
hexStr2Int in do
<- many1 hexDigit
start '-'
char <- many1 hexDigit
end return $ Address (hexStr2Int start) (hexStr2Int end)
Since the addresses themselves are in hexadecimal and always are of at least length 1 I use many1 hexDigit
to read them. I think it would be safe to assume the addresses always are 8 characters (at least on a 32-bit machine) so it would be possible to use count 8 hexDigit
but I haven’t tried it. I’ve found two ways of converting a string representation of a hexadecimal number into an Integer
. Above I use the fact that Prelude.read
interprets a string beginning with 0x
as a hexadecimal number. The other way I’ve found is the slightly less readable fst . (!! 0) . readHex
. According to the man-page the addresses are separated by a single dash so I’ve hardcoded that in there.
Testing the function is fairly simple. Using gchi
, first load the source file then use parse
:
*Main> parse parseAddress "" "0-1"
Right (Address {start = 0, end = 1})
*Main> parse parseAddress "hhh" "01234567-89abcdef"
Right (Address {start = 19088743, end = 2309737967})
Seems to work well enough. :-)
Next up, parsing the permissions. This is so very straightforward that I don’t think I need to comment on it:
= let
parsePerms = case a of
cA a 'p' -> Private
's' -> Shared
in do
<- anyChar
r <- anyChar
w <- anyChar
x <- anyChar
a return $ Perms (r == 'r') (w == 'w') (x == 'x') (cA a)
For parsing the device information I use the same strategy as for the address range above, this time however the separating charachter is a colon:
= let
parseDevice = Prelude.read . ("0x" ++)
hexStr2Int in do
<- many1 digit
maj ':'
char min <- many1 digit
return $ Device (hexStr2Int maj) (hexStr2Int min)
Next is to tie it all together and create a MemRegion instance:
= let
parseRegion = Prelude.read . ("0x" ++)
hexStr2Int = (many1 $ char ' ') >> (many1 $ anyChar)
parsePath in do
<- parseAddress
addr ' '
char <- parsePerms
perm ' '
char <- many1 hexDigit
offset ' '
char <- parseDevice
dev ' '
char <- many1 digit
inode ' '
char <- parsePath <|> string ""
path return $ MemRegion addr perm (hexStr2Int offset) dev (Prelude.read inode) path
The only little trick here is that there are lines that lack the pathname. Here’s an example from the man-page:
address perms offset dev inode pathname
08058000-0805b000 rwxp 00000000 00:00 0
It should be noted that it seems there is a space after the inode entry so I keep a char ' '
in the main function. Then I try to parse the line for a path, if there is none that attempt will fail immediately and instead I parse for an empty string, parsePath <|> string ""
. The pathname seems to be prefixed with a fixed number of spaces, but I’m lazy and just consume one or more. I’m not sure exactly what characters are allowed in the pathname itself so I’m lazy once more and just gobble up whatever I find.
To exercise what I had so far I decided to write a function that reads the maps
file for a specific process, based on its pid
, parses the contents and collects all the MemRegion
instances in a list.
= let
getMemRegions pid = "/proc" </> show pid </> "maps"
fp = parse parseRegion "parseRegion"
doParseLine' = case (doParseLine' l) of
doParseLine l Left _ -> error "Failed to parse line"
Right x -> x
in do
<- liftM lines $ readFile fp
mapContent return $ map doParseLine mapContent
The only thing that really is going on here is that the lines are passed from inside an IO monad into the Parser monad and then back again. After this I can try it out by:
*Main> getMemRegions 1
This produces a lot of output so while playing with it I limited the mapping to the four first lines by using take
. The last line then becomes:
return $ map doParseLine (take 4 mapContent)
Now it’s easy to add a main
that uses the first command line argument as the pid
:
= do
main <- liftM (Prelude.read . (!! 0)) getArgs
pid <- getMemRegions pid
regs mapM_ (putStrLn . show) regs
Well, that concludes my first adventure in parsing :-)
[Edit 27-05-2007 13:15]
I received an email asking for it so here are the import statements I ended up with:
import Control.Monad
import System
import System.FilePath
import Text.ParserCombinators.Parsec