Fresh & New(er)

discussion of issues around digital media and museums by Seb Chan

Fresh & New(er) header image 2

Powerhouse releases a Python HTML Sanitiser for developers to use (BSD license)

August 21st, 2008 by Seb Chan

As you’ve heard, we’ve been working on a whole lot of new projects. And with new projects comes new code. I can’t say a lot more about these projects right now, but we’ve been using Python and the Django framework to develop them. So here’s the first of the spinoff products that we’re putting out under a BSD license for everyone to benefit from.

Over to Dan MacKinlay, one of our Python gurus, to tell you all about the HTML Sanitiser and why it matters.

“So the idea with the Python HTML Sanitizer is that we are consuming data from a wide variety of client websites, and we need to get their HTML data in a form that’s useful to us. This means -

1) standards compliant XHTML
2) … bereft of formatting quirks which break our site …
3) … and free from exploits for cross-site scripting and other browser-bugs that can compromise user security.

Normally, you can sidestep the HTML sanitization process by writing your own content, or using a special markup language (say, Markdown) – but when you are consuming HTML from clients’ websites this is not an option. They simply aren’t written in Markdown.

Stripping ALL HTML tags out would be another common option. That’s not reasonable for us, however, since we are supposed to be extracting rich information from our clients sites, and some of it is really useful and semantic – links, citations and definitions. things we don’t want to filter out, or punish them for using.

Rather, we’d probably like to reward them by keeping that markup and indexing on it.

By the same token, many clients use old markup (think HTML3), invalid or badly-formed markup or merely use types of markup which are inconvenient for us to display. (br – or even td tags – instead of p) Moreover, when a site is old enough to have such ancient markup in, it’s reasonable to think that maybe other types of maintenance has lapsed too — such as security maintenance.

We can’t blithely assume that every client site is free from malicious Javascript or whatever – that’s a one way ticket to weakest-link security hell. Already we’ve noticed that two partner sites have been hacked in the course of the project so far (these days we’d assume that a fair proportion of traffic to most dynamic websites is malicious).

Solution – the HTML Sanitiser.

This a flexible, adaptable HTML sanitising module (both parsing and cleaning) that can be tweaked to let through rich markup from good client sites, and salvage what it can from bad client sites. This is the approach chosen by things like PHP5′s HTML Purifier and Ruby’s HTML:Sanitizer, but since our scraping code is in Python, we’ve had to build our own, leveraging the power of the awesome BeautifulSoup HTML parser.

Since a lot of people need to solve similar problems to this, and many eyes make for more secure code, we’ve open-sourced it.

Go and download it, make changes and update the codebase.”

Tags: 2 Comments

  • http://www.ideum.com James Kassemi

    Great contribution! Congrats on tackling a very prickly problem.

    Have you seen the _sanitizeHTML routine in Mark Pilgrim’s feedparser.py (http://www.feedparser.org/)? I know it can pre-filter through tidy – does BeautifulSoup perform the same function in this module?

  • dan mackinlay

    @James Kassemi – yes, I’ve seen Mark Pilgrim’s feedparser, and it does indeed rock. However, it’s not in itself sufficient for what we need to do – for screen scraping, once you have cleaned your HTML, you are likely to want to do further processing on it – say, converting relative links to be absolute, or enforcing <p> tags instead of <br /> to denote paragraphs. Thus, it suits us to keep the document parse tree around and perform various cleaning operations on that rather than parsing and re-parsing for each, and to provide an interface that makes that easy. (we’re currently maintaining half a dozen different sets of parse rules for as many, er, idiosyncratic partner websites.)

    Right now the version on launchpad uses BeautifulSoup for regularising the HTML, as you picked, but we’re testing a version that adds the option to use HTML5lib html5lib to do the pre-filtering-and-parsing – No reason that we couldn’t also add an option to use libtidy…

    (aside: html5lib uses a whitelist that claims descendance from Universal Feed Parser, with extensions for MathML and SVG – it’s not such a large world in open source)