Posts tagged "parsing":
Adventures in parsing, part 4
I received a few comments on part 3 of this little mini-series and I just wanted
to address them. While doing this I still want the main functions of the parser
parseXxx
to read like the maps
file itself. That means I want to avoid
"reversing order" like thenChar
and thenSpace
did in part 2. I also don't
want to hide things, e.g. I don't want to introduce a function that turns (a <*
char ' ') <*> b
into a <#> b
.
So, first up is to do something about hexStr2Int <$> many1 hexDigit
which
appears all over the place. I made it appear in even more places by moving
around a few parentheses; the following two functions are the same:
foo = a <$> (b <* c) bar = (a <$> b) <* c
Then I scrapped hexStr2Int
completely and instead introduced hexStr
:
hexStr = Prelude.read . ("0x" ++) <$> many1 hexDigit
This means that parseAddress
can be rewritten to:
parseAddress = Address <$> hexStr <* char '-' <*> hexStr
Rather than, as Conal suggested, introduce an infix operation that addresses the
pattern (a <* char ' ') <*> b
I decided to do something about a <* char c
. I
feel Conal's suggestion, while shortening the code more than my solution, goes
against my wish to not hide things. This is the definition of <##>
:
(<##>) l r = l <* char r
After this I rewrote parseAddress
into:
parseAddress = Address <$> hexStr <##> '-' <*> hexStr
The pattern (== c) <$> anyChar
appears three times in parsePerms
so it got a
name and moved down into the where
clause. I also modified cA
to use pattern
matching. I haven't spent much time considering error handling in the parser, so
I didn't introduce a pattern matching everything else.
parsePerms = Perms <$> pP 'r' <*> pP 'w' <*> pP 'x' <*> (cA <$> anyChar) where pP c = (== c) <$> anyChar cA 'p' = Private cA 's' = Shared
The last change I did was remove a bunch of parentheses. I'm always a little hesitant removing parentheses and relying on precedence rules, I find I'm even more hesitant doing it when programming Haskell. Probably due to Haskell having a lot of infix operators that I'm unused to.
The rest of the parser now looks like this:
parseDevice = Device <$> hexStr <##> ':' <*> hexStr parseRegion = MemRegion <$> parseAddress <##> ' ' <*> parsePerms <##> ' ' <*> hexStr <##> ' ' <*> parseDevice <##> ' ' <*> (Prelude.read <$> many1 digit) <##> ' ' <*> (parsePath <|> string "") where parsePath = (many1 $ char ' ') *> (many1 anyChar)
I think these changes address most of the comments Conal and Twan made on the previous part. Where they don't I hope I've explained why I decided not to take their advice.
Comment by Jedaï:
That's really pretty ! Code you can read, but concise, Haskell is really good at that, though I need to look at how Applicative works its magic. :)
Good work !
Comment by Conal Elliot:
Magnus wrote
I also don’t want to hide things, e.g. I don’t want to introduce a function that turns
(a <* char ' ') <*> b
intoa <#> b
.
I'm puzzled about this comment. Aren't all of your definitions (as well as much of Parsec and other Haskell libraries) "hiding things"?
What appeals to me about a <#> b = (a <* char ' ') <*> b
(and similarly for,
say "a <:> b
", is that it captures the combination of a character separator
and <*>
-style application. As your example illustrates (and hadn't previously
occurred to me), this combination is very common.
Response to Conal:
Conal, you are right and I was unclear in what I meant. Basically I like the
idea of reading the parseXxx
functions and see the structure of the original
maps
file. At the moment I think that
parseAddress = Address <$> hexStr <##> '-' <*> hexStr
better reflects the structure of the maps
file than hiding away the separator
inside an operator. I also find it doesn't require me to carry a lot of "mental
baggage" when reading the code (I suspect this is the thing that's been
bothering me with the love of introducing operators that seems so prevalent
among Haskell developers, thanks for helping me put a finger on it). However,
your persistence might be paying off ;-) I'm warming to the idea. I just have to
come up with a scheme for naming operators that allows easy reading of the code.
Comment by Conal Elliott:
Oh! I'm finally getting what you've meant about "hiding things" vs "reflect[ing]
the structure of the maps
file". I think you want the separator characters to
show up in the parser, and between the sub-parsers that they separate.
Maybe what's missing in my <#>
suggestion is that the choice of the space
character as a separator is far from obvious, and I guess that's what you're
saying about "mental baggage" naming the operators for easy reading.
I suppose you could use sepSpace
and sepColon
as operator names.
parseAddress = Address <$> hexStr `sepColon` hexStr parseRegion = MemRegion <$> parseAddress `sepSpace` parsePerms `sepSpace` ...
Still, an actual space/colon character would probably be clearer. For colon, you
could use <:>
, but what for space?
Response to Conal:
Conal, that's exactly what I mean, just much more clearly expressed than I could ever hope to do.
I too was thinking of the problem with space
in an operator…
Adventures in parsing, part 3
I got a great many comments, at least by my standards, on my earlier two posts
on parsing in Haskell. Especially on the latest one. Conal posted a comment on
the first pointing me towards liftM
and its siblings, without telling me that
it would only be the first step towards "applicative style". So, here I go
again…
First off, importing Control.Applicative
. Apparently <|>
is defined in both
Applicative
and in Parsec
. I do use <|>
from Parsec
so preventing
importing it from Applicative
seemed like a good idea:
import Control.Applicative hiding ( (<|>) )
Second, Cale pointed out that I need to make an instance for
Control.Applicative.Applicative
for GenParser
. He was nice enough to point
out how to do that, leaving syntax the only thing I had to struggle with:
instance Applicative (GenParser c st) where pure = return (<*>) = ap
I decided to take baby-steps and I started with parseAddress
. Here's what it
used to look like:
parseAddress = let hexStr2Int = Prelude.read . ("0x" ++) in do start <- liftM hexStr2Int $ thenChar '-' $ many1 hexDigit end <- liftM hexStr2Int $ many1 hexDigit return $ Address start end
On Twan's suggestion I rewrote it using where
rather than let ... in
and
since this was my first function I decided to go via the ap
function (at the
same time I broke out hexStr2Int
since it's used in so many places):
parseAddress = do start <- return hexStr2Int `ap` (thenChar '-' $ many1 hexDigit) end <- return hexStr2Int `ap` (many1 hexDigit) return $ Address start end
Then on to applying some functions from Applicative
:
parseAddress = Address start end where start = hexStr2Int <$> (thenChar '-' $ many1 hexDigit) end = hexStr2Int <$> (many1 hexDigit)
By now the use of thenChar
looks a little silly so I changed that part into
many1 hexDigit <* char '-'
instead. Finally I removed the where
part
altogether and use <*>
to string it all together:
parseAddress = Address <$> (hexStr2Int <$> many1 hexDigit <* char '-') <*> (hexStr2Int <$> (many1 hexDigit))
From here on I skipped the intermediate steps and went straight for the last form. Here's what I ended up with:
parsePerms = Perms <$> ( (== 'r') <$> anyChar) <*> ( (== 'w') <$> anyChar) <*> ( (== 'x') <$> anyChar) <*> (cA <$> anyChar) where cA a = case a of 'p' -> Private 's' -> Shared parseDevice = Device <$> (hexStr2Int <$> many1 hexDigit <* char ':') <*> (hexStr2Int <$> (many1 hexDigit)) parseRegion = MemRegion <$> (parseAddress <* char ' ') <*> (parsePerms <* char ' ') <*> (hexStr2Int <$> (many1 hexDigit <* char ' ')) <*> (parseDevice <* char ' ') <*> (Prelude.read <$> (many1 digit <* char ' ')) <*> (parsePath <|> string "") where parsePath = (many1 $ char ' ') *> (many1 anyChar)
I have to say I'm fairly pleased with this version of the parser. It reads about
as easy as the first version and there's none of the "reversing" that thenChar
introduced.
Comment by Conal Elliott:
A thing of beauty! I'm glad you stuck with it, Magnus.
Some much smaller points:
- The pattern
(== c) <$> anyChar
(nicely written, btw) arises three times, so it might merit a name. - Similarly for
hexStr2Int <$> many1 hexDigit
, especially when you rewritef <$> (a <* b)
to(f <$> a) <* b
. - The pattern
(a <* char ' ') <*> b
comes up a lot. How about naming it also, with a nice infix op, saya <#> b
? - The cA definition could use pattern matching instead (e.g.,
cA 'p' = Private
andcA 's' = Shared
). - Some of your parens are unnecessary (3rd line of
parseDevice
and last ofparseRegion
), since application binds more tightly than infix ops.
Comment by Twan van Laarhoven:
First of all, note that you don't need parentheses around parseSomething <*
char ' '
.
You can also simplify things a bit more by combining hexStr2Int <$> many1
hexDigit
into a function, then you could say:
parseHex = hexStr2Int <$> many1 hexDigit parseAddress = Address <$> parseHex <* char '-' <*> parseHex parseDevice = Device <$> parseHex <</em> char ':' <*> parseHex
Also, in cA
, should there be a case for character other than 'p' or 's'?
Otherwise the program could fail with a pattern match error.
Response to Conal and Twan:
Conal and Twan, thanks for your suggestions. I'll put them into practice and post the "final" result as soon as I find some time.
More adventures in parsing
I received an interesting comment from Conal Elliott on my previous post on parsing I have to admit I wasn't sure I understood him at first, I'm still not sure I do, but I think I have an idea of what he means :-)
Basically my code is very sequential in that I use the do
construct everywhere
in the parsing code. Personally I thought that makes the parser very easy to
read since the code very much mimics the structure of the maps
file. I do
realise the code isn't very "functional" though so I thought I'd take Conal's
comments to heart and see what the result would be.
Let's start with observation that every entity in a line is separated by a space. However some things are separated by other characters. So the first thing I did was write a higher-order function that first reads something, then reads a character and returns the first thing that was read:
thenChar c f = f >>= (\ r -> char c >> return r)
Since space is used as a separator so often I added a short-cut for that:
thenSpace = thenChar ' '
Then I put that to use on parseAddress
:
parseAddress = let hexStr2Int = Prelude.read . ("0x" ++) in do start <- thenChar '-' $ many1 hexDigit end <- many1 hexDigit return $ Address (hexStr2Int start) (hexStr2Int end)
Modifying the other parsing functions using =thenChar~ and thenSpace
is
straight forward.
I'm not entirely sure I understand what Conal meant with the part about liftM
in his comment. I suspect his referring to the fact that I first read characters
and then convert them in the "constructors". By using liftM
I can move the
conversion "up in the code". Here's parseAddress
after I've moved the calls to
hexStr2Int
:
parseAddress = let hexStr2Int = Prelude.read . ("0x" ++) in do start <- liftM hexStr2Int $ thenChar '-' $ many1 hexDigit end <- liftM hexStr2Int $ many1 hexDigit return $ Address start end
After modifying the other parsing functions in a similar way I ended up with this:
parsePerms = let cA a = case a of 'p' -> Private 's' -> Shared in do r <- liftM (== 'r') anyChar w <- liftM (== 'w') anyChar x <- liftM (== 'x') anyChar a <- liftM cA anyChar return $ Perms r w x a parseDevice = let hexStr2Int = Prelude.read . ("0x" ++) in do maj <- liftM hexStr2Int $ thenChar ':' $ many1 hexDigit min <- liftM hexStr2Int $ many1 hexDigit return $ Device maj min parseRegion = let hexStr2Int = Prelude.read . ("0x" ++) parsePath = (many1 $ char ' ') >> (many1 $ anyChar) in do addr <- thenSpace parseAddress perm <- thenSpace parsePerms offset <- liftM hexStr2Int $ thenSpace $ many1 hexDigit dev <- thenSpace parseDevice inode <- liftM Prelude.read $ thenSpace $ many1 digit path <- parsePath <|> string "" return $ MemRegion addr perm offset dev inode path
Is this code more "functional"? Is it easier to read? You'll have to be the judge of that…
Conal, if I got the intention of your comment completely wrong then feel free to tell me I'm an idiot ;-)
Comment by Holger:
Your story and Conal's comment inspired me to play around with liftM and I came up with this version of parseAddress:
parseAddress = let hexStr2Int = Prelude.read . ("0x" ++) in liftM3 (\ x _ y -> Address (hexStr2Int x) (hexStr2Int y)) (many1 hexDigit) (char '-') (many1 hexDigit)
You could probably rewrite all the functions in a similar way, but honestly I find your original code in do-notation much easier to read.
Response to Holger:
Yeah, the same line of thought made it a little difficult to fall asleep yesterday (yes, I know, nerdiness taken to new levels). My thoughts was something like this:
parseAddress = let hexStr2Int = ... in liftM2 Address (liftM hexStr2Int $ thenChar '-' $ many1 hexDigit) (liftM hexStr2Int $ many1 hexDigit)
I agree with you on readability. I also wonder if laziness could bite back or if `liftM2` guarantees an order of evaluation.
Comment by Holger:
I just looked at the source of liftM2/3 and it seems that it basically just resolves to an expression in do-notation. So it's just a shortcut and therefore should yield the same program.
Comment by Cale Gibbard:
liftM2
guarantees an ordering on the monadic computation, because it's defined like:
liftM2 f x y = do { u <- x; v <- y; return (f u v) }
Though, that's a little different from guaranteeing an order of evaluation – depending on the monad, the order of evaluation will vary. In any case, it shouldn't be much different from what you originally had.
Another thing you might like to play around with, at least in your head, is the fact that:
liftM2 f x y = return f `ap` x `ap` y liftM3 f x y z = return f `ap` x `ap` y `ap` z
and so on, which leads up to the style embodied by the Control.Applicative
library, though if you really want to play around with that, you'll need to
write a quick instance of Applicative for GenParser, which should just consist
of making pure = return
and (<*>) = ap
.
Comment by Twan van Laarhoven:
The best way to make parsing code readable is to use Data.Applicative. Also,
most people prefer where to let..in
if possible. This gives something like:
parseHexStr = Prelude.read . ("0x" ++) many1 hexDigit parsePath = many1 (char ' ') *> many1 anyChar parseAddress = Address hexStr *> char '-' hexStr parseRegion = MemRegion parseAddress *> char ' ' parsePerms *> char ' ' parseHexStr *> char ' ' parseDevice *> char ' ' (Prelude.read many1 digit) *> char ' ' (parsePath return "")
Basicly (liftM# f x y z)
can be written as f x <*> y <*> z
.
Comment by Nick:
I find the thenSpace
a bit difficult to read.I think something like this is
more natural, as it maintains the left-to-right relationship of the parsed data
and the following space:
aChar c r = char c >> return r aSpace = aChar ' ' ... do start <- many1 hexDigit >> aChar ....
note that I haven't tested this and my haskell-fu is not very strong, so I could be way off here.
Comment by Conal Elliot:
Yes, that's exactly the direction i had in mind. once you switch from "do" style to "liftM#" style, it's a simple step to replace the monad operators to applicative functor operators.
Response to Nick:
Nick, yes your way of writing it is easier to read. You'll need to change to
using >>=
though:
do start <- many1 hexDigit >>= aChar '-'
Then you can mix in `liftM` as well:
do start <- liftM hexStr2Int $ many1 hexDigit >>= aChar '-'
to do the conversion. However, I think my initial version is even clearer:
do start <- liftM hexStr2Int $ many1 hexDigitc char '-'
Response to Conal Elliot:
Conal, ok, you're one sneaky little b… ;-) I'll have to look at the applicative operators to see what I think of them.
Adventures in parsing
I've long wanted to dip my toes in the Parsec
water. I've made some attempts
before, but always stumbled on something that put me in the doldrums for so long
that I managed to repress all memories of ever having tried. A few files
scattered in my ~/devo/test/haskell
directory tells the story of my failed
attempts. Until now that is :-)
I picked a nice and regular task for my first real attempt: parsing
/proc/<pid>/maps
. First a look at the man-page offers a good description of
the format of a line:
address perms offset dev inode pathname 08048000-08056000 r-xp 00000000 03:0c 64593 /usr/sbin/gpm
So, I started putting together some datatypes. First off the address range:
data Address = Address { start :: Integer, end :: Integer } deriving Show
Then I decided that the 's'/'p' in the permissions should be called Access
:
data Access = Shared | Private deriving Show
The basic permissions (rwx
) are simply represented as booleans:
data Perms = Perms { read :: Bool, write :: Bool, executable :: Bool, access :: Access } deriving Show
The device is straightforward as well:
data Device = Device { major :: Integer, minor :: Integer } deriving Show
At last I tie it all together in a final datatype that represents a memory region:
data MemRegion = MemRegion { address :: Address, perms :: Perms, offset :: Integer, device :: Device, inode :: Integer, pathname :: String } deriving Show
All types derive Show
(and receive default implementations of show
, at least
when using GHC) so that they are easy to print.
Now, on to the actual "parsec-ing". Faced with the option of writing it top-down
or bottom-up I chose the latter. However, since the format of a single line in
the maps
file is so simple it's easy to imagine what the final function will
look like. I settled on bottom-up since the datatypes provide me with such an
obvious splitting of the line. First off, parsing the address range:
parseAddress = let hexStr2Int = Prelude.read . ("0x" ++) in do start <- many1 hexDigit char '-' end <- many1 hexDigit return $ Address (hexStr2Int start) (hexStr2Int end)
Since the addresses themselves are in hexadecimal and always are of at least
length 1 I use many1 hexDigit
to read them. I think it would be safe to assume
the addresses always are 8 characters (at least on a 32-bit machine) so it would
be possible to use count 8 hexDigit
but I haven't tried it. I've found two
ways of converting a string representation of a hexadecimal number into an
Integer
. Above I use the fact that Prelude.read
interprets a string
beginning with 0x
as a hexadecimal number. The other way I've found is the
slightly less readable fst . (!! 0) . readHex
. According to the man-page the
addresses are separated by a single dash so I've hardcoded that in there.
Testing the function is fairly simple. Using gchi
, first load the source file
then use parse
:
*Main> parse parseAddress "" "0-1" Right (Address {start = 0, end = 1}) *Main> parse parseAddress "hhh" "01234567-89abcdef" Right (Address {start = 19088743, end = 2309737967})
Seems to work well enough. :-)
Next up, parsing the permissions. This is so very straightforward that I don't think I need to comment on it:
parsePerms = let cA a = case a of 'p' -> Private 's' -> Shared in do r <- anyChar w <- anyChar x <- anyChar a <- anyChar return $ Perms (r == 'r') (w == 'w') (x == 'x') (cA a)
For parsing the device information I use the same strategy as for the address range above, this time however the separating charachter is a colon:
parseDevice = let hexStr2Int = Prelude.read . ("0x" ++) in do maj <- many1 digit char ':' min <- many1 digit return $ Device (hexStr2Int maj) (hexStr2Int min)
Next is to tie it all together and create a MemRegion instance:
parseRegion = let hexStr2Int = Prelude.read . ("0x" ++) parsePath = (many1 $ char ' ') >> (many1 $ anyChar) in do addr <- parseAddress char ' ' perm <- parsePerms char ' ' offset <- many1 hexDigit char ' ' dev <- parseDevice char ' ' inode <- many1 digit char ' ' path <- parsePath <|> string "" return $ MemRegion addr perm (hexStr2Int offset) dev (Prelude.read inode) path
The only little trick here is that there are lines that lack the pathname. Here's an example from the man-page:
address perms offset dev inode pathname 08058000-0805b000 rwxp 00000000 00:00 0
It should be noted that it seems there is a space after the inode entry so I
keep a char ' '
in the main function. Then I try to parse the line for a path,
if there is none that attempt will fail immediately and instead I parse for an
empty string, parsePath <|> string ""
. The pathname seems to be prefixed with
a fixed number of spaces, but I'm lazy and just consume one or more. I'm not
sure exactly what characters are allowed in the pathname itself so I'm lazy once
more and just gobble up whatever I find.
To exercise what I had so far I decided to write a function that reads the
maps
file for a specific process, based on its pid
, parses the contents and
collects all the MemRegion
instances in a list.
getMemRegions pid = let fp = "/proc" </> show pid </> "maps" doParseLine' = parse parseRegion "parseRegion" doParseLine l = case (doParseLine' l) of Left _ -> error "Failed to parse line" Right x -> x in do mapContent <- liftM lines $ readFile fp return $ map doParseLine mapContent
The only thing that really is going on here is that the lines are passed from inside an IO monad into the Parser monad and then back again. After this I can try it out by:
*Main> getMemRegions 1
This produces a lot of output so while playing with it I limited the mapping to
the four first lines by using take
. The last line then becomes:
return $ map doParseLine (take 4 mapContent)
Now it's easy to add a main
that uses the first command line argument as the
pid
:
main = do pid <- liftM (Prelude.read . (!! 0)) getArgs regs <- getMemRegions pid mapM_ (putStrLn . show) regs
Well, that concludes my first adventure in parsing :-)
Edit (27-05-2007): I received an email asking for it so here are the import statements I ended up with:
import Control.Monad import System import System.FilePath import Text.ParserCombinators.Parsec
Comment by Conal Elliot:
Congrats on your parser!
Here's an idea that for getting more functional/applicative formulations.
Replace all of the explicitly sequential (do
) parsec code with liftM
,
liftM2
, …. From a quick read-through, I think you can do it. For
parseRegion
, you could use an auxiliary function that discards a following
space, e.g. thenSpace (many1 digit)
.
Also, play with factoring out some of the repeated patterns in parseAddress
,
parsePerms
and parseDevice
.
The more I play with refactoring in my code, the more elegant it gets and the more insight I get. Have fun!
Response to Conal:
Good suggestion. At least if I understand what you mean :-)
Something along the lines of
thenChar c f = f >>= (\ r -> char c >> return r)
with a specialised one for spaces maybe
thenSpace = thenChar ' '
I suppose liftM
and friends can be employed to remove the function calling in
the creation of Address
, Device
and MemRegion
. I'll try to venture into
Parsec
territory once again soon and report on my findings. :-)