Kid's play with HTML in Haskell

In my ever-continuing attempts to replace Python by Haskell as my language of first choice I’ve finally managed to dip a toe in the XML/HTML sea. I decided to use the Haskell XML Toolkit (HXT) even though it’s not packaged for Debian (something I might look into doing one day). HXT depends on tagsoup which also isn’t packaged for Debian. Both packages install painlessly thanks to Cabal.

As the title suggests my itch wouldn’t require anything complicated, but when I’ve previously have looked at any Haskell XML library I’ve always shied away. It all just looks so complicated. It turns out it looks worse than it is, and of course the documentation is poor when it comes to simple, concrete examples with adequate explanations. HXT would surely benefit from documentation at a level similar to what’s available for Parsec. I whish I were equipped to write it.

Anyway, this was my problem. I’ve found an interesting audio webcast. The team behind it has published around 90 episodes already and I’d like to listen to all of them. Unfortunately their RSS feed doesn’t include all episodes so I can’t simply use the trusted hpodder to get all episodes. After manually downloading about 20 of them I thought I’d better write some code to make it less labour-intensive. Here’s the complete script:

module Main where

import System.Environment
import Text.XML.HXT.Arrow

isMp3Link = (==) "3pm." . take 4 . reverse

myReadDoc = readDocument [(a_parse_html, "1"), (a_encoding, isoLatin1),
    (a_issue_warnings, "0"), (a_issue_errors, "0")]

myProc src = (runX $ myReadDoc src >>> deep selectMp3Links >>> getAttrValue "href")
    >>= mapM_ putStrLn

selectMp3Links = hasName "a" >>> hasAttrValue "href" isMp3Link

main = do
    [src] <- getArgs
    myProc src

The thing that took by far the most time was finding out that hasAttrValue exists. I’m currently downloading episodes using the following command line:

curl -L $(for h in $(runhaskell get_mp3links.hs ''); do \
    echo '-O' $h; done)

Yet another set of itches where Haskell has displaced Python as the utensil used for scratching. :-)

Neil Mitchell

Instead of using HXT, you can do this directly in TagSoup:

import System.FilePath
import System.Environment
import Text.HTML.TagSoup

main = do
   [src] <- getArgs
   txt <- readFile src
   mapM_ putStrLn [mp3 | TagOpen "a" atts <- parseTags txt, ("href",mp3) <- atts, takeExtension mp3 == ".mp3"]

I’ve also used System.FilePath to check if an extension is mp3



Nice. I’ve never looked at tagsoup so it’s nice to see some code using it. One nice thing with HXT is that it handles both URLs and files without any extra work by me. Though I think your code above is a bit more “intuitive”, at least for someone like me who doesn’t have a lot of previous exposure to arrows.


All this and a bonus: the code upgrades easily to retrieve broadcasts made in the late afternoon!

isMp3Link = (==) “3pm.” . take 4 . reverse

Just take away that reverse … .

(Sorry, April 1st for me, couldn’t resist the “3pm” part :) ).

Leave a comment