Fresh & New(er)

discussion of issues around digital media and museums by Seb Chan

Fresh & New(er) header image 2

Introducing About NSW – maps, census visualisations, cross search

September 2nd, 2009 by Seb Chan

Well here’s an alpha release of something that we’ve been working on forever (well, almost 2 years). It is called About NSW and is a bit of a Frankenstein creation of different data sets mashed together by a sophisticated backend. The project began with an open-ended brief to be a cross-sectorial experiment in producing new interfaces for existing content online. In short, we were given permission to play in the sandbox and with that terrain comes a process of trial and error, learning and revision.

We’ve had an overwhelming amount of feature requests and unfortunately have not been able to accommodate them all but this does give us an indication of the need to work on solutions to common problems such as –

  • “can we handle electoral boundaries and view particular datasets by suburb postcodes?”
  • “can we aggregate upcoming cultural events?”
  • “can we resolve historical place names on a contemporary Google Map?”

to name just a few.

There’s three active voices in this blog post, my own accompanied by those of Dan MacKinlay (developer) and Renae Mason (producer). Dan reads a lot of overly fat economics and social theory books when not coding and travelling to Indonesia to play odd music in rice paddies; while Renae reads historical fiction, magical realism and design books when not producing and is about to go tango in Buenos Aires – hola!

We figured this blog post might be a warts and all look at the project to date. So grab a nice cup of herbal tea and sit back. There’s connections here to heavyweight developer stuff as well as more philosophical and practical issues relevant to Government 2.0 discussion as well.

So what exactly is About NSW?

Firstly it is the start of an encyclopaedia.

Our brief was never to create original content but to organise what already existed across a range of existing cultural institution websites. There’s some original content in there, but that probably is not be exciting in itself.

While projects like the wonderful Encyclopaedia of New Zealand, ‘Te Ara’ are fantastic, they cost rather more than our humble budget. Knowing up front that we had scant resources for producing ‘new’ content, we have tried to build a contextual discovery service that assists in exposing existing content online. We aimed to form partnerships with content providers to maximise the use of all those fact sheets, images and other information that is already circulating on the web. We figured, why duplicate efforts? In this way, we would like to grow About NSW as a trustworthy channel that delivers cultural materials to new audiences, sharing traffic and statistics with our partners along the way. That said, there’s actually a whole lot of exciting material lurking deep in the original content of the encyclopaedia, including a slew of digitised almanacs that we are yet to expose.

We’re particularly excited to be able to bring together and automatically connect content from various sources that otherwise wouldn’t be “getting out there”. There are a lot of websites that go around and scrape other sites for content – but really getting in there and make good use of their data under (reasonably) unrestrictive license is in facilitated by having the request come from inside government. It’s not all plain sailing, mind – if you look through our site you’ll see that a few partners were afraid to display the full content of their articles and have asked they be locked down.

But, because we work in aggregate, we can enrich source data with correlated material. A simple lucid article about a cultural figure can provide a nice centrepiece for an automatically generated mashup of related resources about said figure. We could go a lot further there in integrating third party content rather than going through the tedious process of creating our own articles by pulling in content from sources like Wikipedia and Freebase. (We certainly never intended to go into the encyclopaedia business!)

Secondly, the site is an explorer of the 2006 Australian Census data. As you might know, the Australian Bureau of Statistics does a rather excellent job of releasing statistical data under a Creative Commons license. What we have done is take this data and build a simple(r) way of navigating it by suburbs. We have also built a dynamic ‘choropleth’ map that allows easy visualising of patterns in a single data set. You can pin suburbs for comparison, and look for patterns across the State. (with extra special bells and whistles built for that by some folks from the Interaction Consortium who worked on the team.)

Third, we’ve started the long and arduous process of using the same map tools to build a cultural collections navigator that allows the discovery of records by suburb. This remains the most interesting part of the site but also the one most fraught by difficulties. For a start, very few institutions have well geo-located collections – certainly not with any consistency of precision. We have tried some tricky correlations to try to maximise the mappable content but things haven’t (yet) turned out the way we want them to.

But, considering the huge data sets we are dealing with we reckon we’ve done pretty well given the data quality issues and the problem of historical places not being able to be reverse geocoded automatically.

Fourth, not much of this would be much chop if we weren’t also trying to build a way of getting the data out in a usable form for others to work with. That isn’t yet available yet mainly because of the thicket of issues around rights and the continuing difficulty in convincing contributors that views of their content on our site can be as valuable, potentially more valuable when connected to other material, than views on their individual silo-ed sites.

Where is the data from?

About NSW has afforded a unique opportunity to work with other organisations that we don’t usually come into contact with and we’ve found that generosity and a willingness to share resources for the benefit of citizens is alive and well in many of our institutions. For example, we approached the The NSW Film & Television Office with a dilemma – most of the images that we can source from the libraries and archives are circa 1900, which is fantastic if you want to see what Sydney looked like back then, but not so great if you want to get a sense of what Sydney looks like today. They kindly came to the party with hundreds of high quality, contemporary images from their locations database which isn’t public facing but currently serves a key business role in attracting film and television productions to NSW.

Continuing along with our obsession for location specific data, we also approached the NSW Heritage Branch who completely dumb-founded us by providing us with not just some of their information on heritage places but the entire NSW State Heritage Register. The same gratitude is extended to the Art Gallery of NSW who filled in a huge gap on the collection map with their collection objects so now audiences can, for the first time, see what places our most beloved artworks are associated with (and sometimes, the wonderful links with heritage places – consider the relationship with the old gold-mining heritage town of Hill End and an on-going Artist in Residency program that is hosted there and has attracted artists such as Russell Drysdale and Brett Whitely). With our places well and truly starting to take shape we decided to add in demographic data with the most recent census from the Australian Bureau of Statistics who noted that their core role in providing raw data leaves them little time to for the presentation layer so they were delighted that we were interested in visualising their work.

Besides our focus on places, we are pretty keen on exploring more about the people who show up in our collection records and history books. To this end, the Australian Dictionary of Biography has allowed us to display extracts of all their articles that relate to people associated with NSW.

As a slight off-shoot to this project, we even worked with NSW Births Deaths and Marriages Registry to build the 100 Years of Baby Names at lives on the central NSW Government site, but that’s a different story, that’s already been blogged about here.

There are of course many other sources we’d like to explore in the future but for now we’ve opted for the low-hanging fruit and would like to thank our early collaborators for taking the leap of faith required and trusting us to re-publish content in a respectful manner.

There are many things we need to improve but what a great opportunity it has been to work on solving some of our common policy and legacy technology problems together.

Cultural challenges

Unfortunately, despite the rosy picture we are beginning to paint here, the other side is that collecting institutions are not accustomed to working across silos and are either not well-resourced to play in other domains.

Comments like “This isn’t our core business!” and “Sounds great but we don’t have time for this!” have been very common. Others have been downright resistant to the idea all together. The latter types prefer to operate a gated-estate that charges for entrance to all the wonders inside – the most explicit being “We don’t think you should be putting that kind of information on your site because everyone should come to us!”.

But we wonder, what’s more important – expert pruning? Or a communal garden that everyone can take pride in and improves over time?

To be fair, these are confronting ideas for institutions that have always been measured by their ‘authoritativeness’ and by the sheer numbers that can be attracted through their gates, and not the numbers who access their expertise.

Unsurprisingly these are the exact same issues being tackled by the Government 2.0 Taskforce.

It’s an unfortunately constructed competitive business and the worth of institutions online is still being measured on the basis of how many people interact with their content on their website. Once those interactions begin to take place elsewhere it becomes more difficult to report despite the fact that it is adding value – after all, how do you quantify that?

We’ve done some nifty initial work with the Google Analytics API to try to simplify data reporting back to contributors but it is more a philosophical objection more than anything.

Add to that Copyright and privacy and you have a recipe for trouble.

Munging data

Did we already say that this project hasn’t been without its problems?

The simplest summary is: web development in government has generally had little respect for the Tim Berners-Lee’s design principle of least power.

While sites abound with complicated Java mapping widgets, visually lush table-based designs and so on, there is almost no investment in pairing that with simple and convenient access to the underlying data in direct, simple, machine-readable way. This is particularly poignant for projects that have started out with high ideals, but have lost funding; all the meticulous work they have expended in creating their rich designs can go to waste if the site design only works in Netscape Navigator 4.

Making simple data sets available is timeless insurance against the shifting ephemeral fads of browser standards, and this season’s latest widget technology, but it’s something few have time for. That line of reasoning is particularly important for our own experimental pilot project. We have been lucky, unlike some of our partners, in that we have designed our site from the ground up to support easy data export. (You might well ask, though, if we can’t turn that functionality on for legal reasons, have we really made any real progress).

As everyone knows, pulling together datasets from different places is just a world of pain. But it is a problem that needs to be solved for any of the future things all of us want to do to get anywhere. Whilst we could insist on standards, what we wanted to experiment with here was how far we could get without mandating standards – because in the real world, especially with government data, a lot of data is collected for a single purpose and not considered for sharing and cross-mixing.

We’d love plain structured data in a recognised format, but it isn’t generally available. (RDF, OAI-PMH, ad hoc JSON over REST, KML – even undocumented XML with sensibly named elements will do) Instead, what there usually is are poorly marked up HTML sites, or databases full of inconsistent character encodings, that need to be scraped – or even data that we need to stitch together from several different sources to re-assemble the record in our partner’s database because their system won’t let them export it in one chunk. Elsewhere we’ve had nice Dublin Core available over OAI, but even once all the data is in, getting it to play nicely together is tricky, and parsing Dublin-core’s free-text fields has still been problematic.

In our standards-free world, there is also the problem of talking back.

Often we’re faced with the dilemma that we believe that we have in some way value-added to the data we have been given – but we have no way of easily communicating that back to its source.

Maybe we’ve found inconsistencies and errors in the data we have imported, or given “blobs” of data more structure, or our proofreaders have picked up some spelling mistakes. We can’t automatically encode our data back into the various crazy formats it comes in, (well, that it’s twice as much work!), and even do we invest the time on that if there is no agreed way of communicating suggested changes? Or what if the partner in question has lost funding and doesn’t have time to incorporate updates no matter how we provide them?

This is a tricky problem without an easy solution.

What does it run on?

Behind the scenes the site is built pretty much with open spource choices. It was built using on Python using the Django framework, and PostgresQL’s geographic extension postGIS (the combination known as Geodjango).

For the interactive mapping it uses Modest Maps – which allows us to change between tile providers as needed – and everything is pretty modular and re-purposable, and a whole bunch of custom file-system based tile-metadata service code.

Since we have data coming from lots of different providers with very different sets of fields, we store data internally in a general format which can handle arbitrary data – the EAV pattern – although we get more mileage out of our version because of Django’s sophisticated support for data model subclassing.

We have also used Reuters’ Open Calais to cross-map and relate articles initially whilst a bunch of geocoders go to work making sense of some pretty rough geo-data.

We use both the State Government supplied geocoder from the New South Wales government’s Spatial Information Exchange, and Google’s geocoder to fill the gaps.

And we use the Google Analytics, plus the Google Analytics Data Export API to be able to deliver contributor-specific usage data.

We use an extensive list of open-source libraries to make all this work, many of which we have committed patches to along the way.

We do our data parsing with

  • phpserialize for python for rolling quick APIs with out PHP-using friends
  • PyPdf for reading PDFs
  • pyparsingfor parsing specialised formats (e.g. broken “CSV”)
  • Beautiful Soup for page scraping
  • lxml for XML document handling
  • suds for SOAP APIs (and it is absolutely the best, easiest and most reliable python SOAP client out there

Our search index is based off whoosh, with extensive bug fixes by our friendly neighbourhood search guru Andy

We’ve also created some of our own which have been mentioned here before:

  • python-html-sanitizer takes our partners’ horrifically broken or embedded-style-riddled html and makes it something respectable. (based off the excellent html5lib as well as Beautiful Soup)
  • django-ticket is a lightweight DB-backed ticket queue optimised for efficient handling of resource-intensive tasks, like semantic analysis.

—-

So, go an have a play.

We know there are still a few things that don’t quite work but we figure that your eyes might see things different to us. We’re implementing a bunch of UI fixes in the next fortnight too so you might want to check back in a fortnight and see what has improved. Things move fast on the web,

Tags: 8 Comments

  • http://www.rumble.net/ Simon Rumble

    Nice work! One thing that jumps out at me is the “Established” for suburbs and towns. I’m pretty sure Marrickville predates 1998. (in fact, I have a sale poster for our street from 1891 http://rumble.smugmug.com/House/Original-sale-poster-for-our/7159879_fmdtf#459520770_9Q2tP-A-LB)

  • http://www.cityofsound.com/ Dan Hill

    Seb + team, fantastic. Great to see this out there. Well done.

  • Seb Chan

    @simon – yes this is a data quality and field mapping issue that is about to be fixed. That data comes from the Geographic Names Board and as you can see here – http://www.gnb.nsw.gov.au/name_search/extract?id=KWIOBKsETR – Marrickville’s boundaries were “last assigned in 1998″.

    We are removing that field but the origin of the name should be showing in the main record (another bug we’re fixing).

  • http://about.nsw.gov.au dan mackinlay

    @simon, yes… we really should post a “beginers guide to the wild ontology of australian geography”. That’s just one example of the weird definitions that have made life interesting… in fact, not only do agencies disagree about when a given geographical division begin, they disagree about where a given suburb’s borders are, whether something is a suburb or a town… the Names board has slightly different definitions to the post office, which in turn disagrees with the ABS. Can’t wait until we get the electoral commission border in there to see how THEY differ. The really interesting thing is that none of these agencies necessarily agree with what people who live in a given area will call it, or where the borders are, when they might think it was founded, and so on.

  • http://greatriversnet.organdwww.mnhs.org Rose Sherman

    Congratulations on this web site! It looks great. We’re working on version 2 of our endeavour of a cross institutional collections search, so I related to many of your comments. Thanks for sharing the technical details … can you share how many staff were involved in creating About NSW, how long have you been working on it, and were all the staff PowerhouseMuseum.com staff or where there staff from your partner organizations as well? Also, I’m unclear about whether or not you’re hosting the content? Do you ingest content from partner organizations, then clean it up and return it to them? I didn’t quite understand the talk back discussion.
    Extend congratulations to your team!

  • http://www.freshandnew.org/ Seb Chan

    @rose: The core team was 4 people operating out of Powerhouse – only 2 full time (Dan and Renae).

    We ingest data and clean it up where possible. There’s a whole new slew of data cleanup going on right now which should become apparent on the website shortly.

    The problem is how do we return this cleaned up data to the source organisations. They are not geared up to accept ‘fixed data’ nor are they necessarily (psychologically) ready to accept the very idea of someone else fixing their data for them.

    This is a much broader issue as we said in the (long) post. Dan has done a lot of research around ‘post normal science‘ and how we deal with the inevitable basing of assumptions on flawed data.

  • Andrew Perry

    Great article – thanks for sharing all the inside goss on a great project and the cultural/technical/legal challenges.

  • rebecca pinchin

    great to read such a thorough and thoroughly readable explanation of a project that is so intriguing in its scope and vision!