Kid's play with HTML in Haskell
- Magnus Therning
In my ever-continuing attempts to replace Python by Haskell as my language of first choice I’ve finally managed to dip a toe in the XML/HTML sea. I decided to use the Haskell XML Toolkit (HXT) even though it’s not packaged for Debian (something I might look into doing one day). HXT depends on tagsoup which also isn’t packaged for Debian. Both packages install painlessly thanks to Cabal.
As the title suggests my itch wouldn’t require anything complicated, but when I’ve previously have looked at any Haskell XML library I’ve always shied away. It all just looks so complicated. It turns out it looks worse than it is, and of course the documentation is poor when it comes to simple, concrete examples with adequate explanations. HXT would surely benefit from documentation at a level similar to what’s available for Parsec. I whish I were equipped to write it.
Anyway, this was my problem. I’ve found an interesting audio webcast. The team behind it has published around 90 episodes already and I’d like to listen to all of them. Unfortunately their RSS feed doesn’t include all episodes so I can’t simply use the trusted hpodder to get all episodes. After manually downloading about 20 of them I thought I’d better write some code to make it less labour-intensive. Here’s the complete script:
module Main where
import System.Environment
import Text.XML.HXT.Arrow
isMp3Link = (==) "3pm." . take 4 . reverse
myReadDoc = readDocument [(a_parse_html, "1"), (a_encoding, isoLatin1),
(a_issue_warnings, "0"), (a_issue_errors, "0")]
myProc src = (runX $ myReadDoc src >>> deep selectMp3Links >>> getAttrValue "href")
>>= mapM_ putStrLn
selectMp3Links = hasName "a" >>> hasAttrValue "href" isMp3Link
main = do
[src] <- getArgs
myProc src
The thing that took by far the most time was finding out that hasAttrValue
exists. I’m currently downloading episodes using the following command line:
curl -L $(for h in $(runhaskell get_mp3links.hs 'http://url.of.page/'); do \
echo '-O' $h; done)
Yet another set of itches where Haskell has displaced Python as the utensil used for scratching. :-)