TagSoup, meet Parsec!
- Magnus Therning
Recently I began writing a tool to scrape some information off a web site for some off-line processing. After writing up the basics using TagSoup I showed what I had to a colleague. His first comment was “Can’t you use Parsec for that?” It took me a second to realise that he didn’t mean that I should write my own XML parser but rather that Parsec allows writing parsers of a list of anything. So I thought I’d see just what it’d take to create a parser for [Tag]
.
A look at the string parser shipped with Parsec offered a lot of inspiration.
First the basic type, TagParser
:
type TagParser = GenParser Tag
The basic function of Parsec is tokenPrim
, basically that’s what other basic parsers use. Taking a cue from the string parser implementation I defined a function called satisfy
:
= tokenPrim
satisfy f show
-> updatePosTag pos t)
(\ pos t _ -> if (f t) then Just t else Nothing) (\ t
The positioning in a list of tags simply an increase of column, irrespective of what tag is processed:
= incSourceColumn s 1 updatePosTag s _
Now I have enough to create the first Tag
parser—one that accepts a single instance of the specified kind:
= satisfy (~== t) <?> show t tag t
It’s important to stick the supplied tag on the right of (~==)
. See its documentation for why that is. The second parser is one that accepts any kind of tag:
= satisfy (const True) anyTag
So far so good. The next parser to implement is one that accepts any kind of tag out of a list of tags. Here I want to make use of the convenient behaviour of (~==)
so I’ll need to implement a custom version of elem
:
`elemTag` r = or $ l `elemT` r
l where
`elemT` [] = [False]
l `elemT` (r:rs) = (l ~== r) : l `elemT` rs l
With that in place it’s easy to implement oneOf
and noneOf
:
= satisfy (`elemTag` ts)
oneOf ts = satisfy (\ t -> not (t `elemTag` ts)) noneOf ts
So, as an example of what this can be used for here is a re-implementation of TagSoup’s partitions:
= liftM2 (:)
partitions t $ noneOf [t])
(many $ liftM2 (:) (tag t) (many $ noneOf [t])) (many
Of course the big question is whether I’ll rewrite my original code using Parsec. Hmm, probably not in this case, but the next time I need to do some web page scraping it offers yet another option for doing it.