API Collection databases Metadata open content Semantic Web

Things clever people do with your data #65535: Introducing ‘Free Your Metadata’

Last year Seth van Hooland at the Free University Brussels (ULB) approached us to look at how people used and navigated our online collection.

A few days ago Seth and his colleague Ruben Verborgh from the University Ghent launched Free Your Metadata – a demonstrator site for showing how even irregular metadata can have valued to others and how, if it is released rather than clutched tightly onto (until that mythical day when it is ‘perfect’), it can be cleaned up and improved using new software tools.

What’s awesome is that Seth & Ruben used the Powerhouse’s downloadable collection datafile as the test data for the project.

Here’s Seth and his team talking about the project.

F&N: What made the Powerhouse collection attractive for use as a data source?

Number one, it’s available for everyone and therefore our experiment can be repeated by others. Otherwise, the records are very representative for the sector.

F&N: Was the data dump more useful than the Collection API we have available?

This was purely due to the way Google Refine works: on large amounts of data at once. But also, it enables other views on the data, e.g., to work in a column-based way (to make clusters). We’re currently also working on a second paper which will explain the disadvantages of APIs.

F&N: What sort of problems did you find with our collection?

Sometimes really broad categories. Other inconveniences could be solved in the cleaning step (small textual variations, different units of measurement). All issues are explained in detail in the paper (which will be published shortly). But on the whole, the quality is really good.

F&N: Why do you think museums (and other organisations) have such difficulties doing simple things like making their metadata available? Is there a confusion between metadata and ‘images’ maybe?

There is a lot of confusion about what the best way is to make metadata available. One of the goals of the Free Your Metadata initiative, is to put forward best practices to do this. Institutions such as libraries and museums have a tradition to only publish information which is 100% complete and correct, which is more or less impossible in the case of metadata.

F&N: What sorts of things can now be done with this cleaned up metadata?

We plan to clean up, reconcile, and link several other collections to the Linked Data Cloud. That way, collections are no longer islands, but become part of the interlinked Web. This enables applications that cross the boundaries of a single collection. For example: browse the collection of one museum and find related objects in others.

F&N: How do we get the cleaned up metadata back into our collection management system?

We can export the result back as TSV (like the original result) and e-mail it. Then, you can match the records with your collection management system using records IDs.

Go and explore Free Your Metadata and play with Google Refine on your own ‘messy data’.

If you’re more nerdy you probably want to watch their ‘cleanup’ screencast where they process the Powerhouse dataset with Google Refine.

Geotagging & mapping Interactive Media Semantic Web

Introducing About NSW – maps, census visualisations, cross search

Well here’s an alpha release of something that we’ve been working on forever (well, almost 2 years). It is called About NSW and is a bit of a Frankenstein creation of different data sets mashed together by a sophisticated backend. The project began with an open-ended brief to be a cross-sectorial experiment in producing new interfaces for existing content online. In short, we were given permission to play in the sandbox and with that terrain comes a process of trial and error, learning and revision.

We’ve had an overwhelming amount of feature requests and unfortunately have not been able to accommodate them all but this does give us an indication of the need to work on solutions to common problems such as –

  • “can we handle electoral boundaries and view particular datasets by suburb postcodes?”
  • “can we aggregate upcoming cultural events?”
  • “can we resolve historical place names on a contemporary Google Map?”

to name just a few.

There’s three active voices in this blog post, my own accompanied by those of Dan MacKinlay (developer) and Renae Mason (producer). Dan reads a lot of overly fat economics and social theory books when not coding and travelling to Indonesia to play odd music in rice paddies; while Renae reads historical fiction, magical realism and design books when not producing and is about to go tango in Buenos Aires – hola!

We figured this blog post might be a warts and all look at the project to date. So grab a nice cup of herbal tea and sit back. There’s connections here to heavyweight developer stuff as well as more philosophical and practical issues relevant to Government 2.0 discussion as well.

So what exactly is About NSW?

Firstly it is the start of an encyclopaedia.

Our brief was never to create original content but to organise what already existed across a range of existing cultural institution websites. There’s some original content in there, but that probably is not be exciting in itself.

While projects like the wonderful Encyclopaedia of New Zealand, ‘Te Ara’ are fantastic, they cost rather more than our humble budget. Knowing up front that we had scant resources for producing ‘new’ content, we have tried to build a contextual discovery service that assists in exposing existing content online. We aimed to form partnerships with content providers to maximise the use of all those fact sheets, images and other information that is already circulating on the web. We figured, why duplicate efforts? In this way, we would like to grow About NSW as a trustworthy channel that delivers cultural materials to new audiences, sharing traffic and statistics with our partners along the way. That said, there’s actually a whole lot of exciting material lurking deep in the original content of the encyclopaedia, including a slew of digitised almanacs that we are yet to expose.

We’re particularly excited to be able to bring together and automatically connect content from various sources that otherwise wouldn’t be “getting out there”. There are a lot of websites that go around and scrape other sites for content – but really getting in there and make good use of their data under (reasonably) unrestrictive license is in facilitated by having the request come from inside government. It’s not all plain sailing, mind – if you look through our site you’ll see that a few partners were afraid to display the full content of their articles and have asked they be locked down.

But, because we work in aggregate, we can enrich source data with correlated material. A simple lucid article about a cultural figure can provide a nice centrepiece for an automatically generated mashup of related resources about said figure. We could go a lot further there in integrating third party content rather than going through the tedious process of creating our own articles by pulling in content from sources like Wikipedia and Freebase. (We certainly never intended to go into the encyclopaedia business!)

Secondly, the site is an explorer of the 2006 Australian Census data. As you might know, the Australian Bureau of Statistics does a rather excellent job of releasing statistical data under a Creative Commons license. What we have done is take this data and build a simple(r) way of navigating it by suburbs. We have also built a dynamic ‘choropleth’ map that allows easy visualising of patterns in a single data set. You can pin suburbs for comparison, and look for patterns across the State. (with extra special bells and whistles built for that by some folks from the Interaction Consortium who worked on the team.)

Third, we’ve started the long and arduous process of using the same map tools to build a cultural collections navigator that allows the discovery of records by suburb. This remains the most interesting part of the site but also the one most fraught by difficulties. For a start, very few institutions have well geo-located collections – certainly not with any consistency of precision. We have tried some tricky correlations to try to maximise the mappable content but things haven’t (yet) turned out the way we want them to.

But, considering the huge data sets we are dealing with we reckon we’ve done pretty well given the data quality issues and the problem of historical places not being able to be reverse geocoded automatically.

Fourth, not much of this would be much chop if we weren’t also trying to build a way of getting the data out in a usable form for others to work with. That isn’t yet available yet mainly because of the thicket of issues around rights and the continuing difficulty in convincing contributors that views of their content on our site can be as valuable, potentially more valuable when connected to other material, than views on their individual silo-ed sites.

Where is the data from?

About NSW has afforded a unique opportunity to work with other organisations that we don’t usually come into contact with and we’ve found that generosity and a willingness to share resources for the benefit of citizens is alive and well in many of our institutions. For example, we approached the The NSW Film & Television Office with a dilemma – most of the images that we can source from the libraries and archives are circa 1900, which is fantastic if you want to see what Sydney looked like back then, but not so great if you want to get a sense of what Sydney looks like today. They kindly came to the party with hundreds of high quality, contemporary images from their locations database which isn’t public facing but currently serves a key business role in attracting film and television productions to NSW.

Continuing along with our obsession for location specific data, we also approached the NSW Heritage Branch who completely dumb-founded us by providing us with not just some of their information on heritage places but the entire NSW State Heritage Register. The same gratitude is extended to the Art Gallery of NSW who filled in a huge gap on the collection map with their collection objects so now audiences can, for the first time, see what places our most beloved artworks are associated with (and sometimes, the wonderful links with heritage places – consider the relationship with the old gold-mining heritage town of Hill End and an on-going Artist in Residency program that is hosted there and has attracted artists such as Russell Drysdale and Brett Whitely). With our places well and truly starting to take shape we decided to add in demographic data with the most recent census from the Australian Bureau of Statistics who noted that their core role in providing raw data leaves them little time to for the presentation layer so they were delighted that we were interested in visualising their work.

Besides our focus on places, we are pretty keen on exploring more about the people who show up in our collection records and history books. To this end, the Australian Dictionary of Biography has allowed us to display extracts of all their articles that relate to people associated with NSW.

As a slight off-shoot to this project, we even worked with NSW Births Deaths and Marriages Registry to build the 100 Years of Baby Names at lives on the central NSW Government site, but that’s a different story, that’s already been blogged about here.

There are of course many other sources we’d like to explore in the future but for now we’ve opted for the low-hanging fruit and would like to thank our early collaborators for taking the leap of faith required and trusting us to re-publish content in a respectful manner.

There are many things we need to improve but what a great opportunity it has been to work on solving some of our common policy and legacy technology problems together.

Cultural challenges

Unfortunately, despite the rosy picture we are beginning to paint here, the other side is that collecting institutions are not accustomed to working across silos and are either not well-resourced to play in other domains.

Comments like “This isn’t our core business!” and “Sounds great but we don’t have time for this!” have been very common. Others have been downright resistant to the idea all together. The latter types prefer to operate a gated-estate that charges for entrance to all the wonders inside – the most explicit being “We don’t think you should be putting that kind of information on your site because everyone should come to us!”.

But we wonder, what’s more important – expert pruning? Or a communal garden that everyone can take pride in and improves over time?

To be fair, these are confronting ideas for institutions that have always been measured by their ‘authoritativeness’ and by the sheer numbers that can be attracted through their gates, and not the numbers who access their expertise.

Unsurprisingly these are the exact same issues being tackled by the Government 2.0 Taskforce.

It’s an unfortunately constructed competitive business and the worth of institutions online is still being measured on the basis of how many people interact with their content on their website. Once those interactions begin to take place elsewhere it becomes more difficult to report despite the fact that it is adding value – after all, how do you quantify that?

We’ve done some nifty initial work with the Google Analytics API to try to simplify data reporting back to contributors but it is more a philosophical objection more than anything.

Add to that Copyright and privacy and you have a recipe for trouble.

Munging data

Did we already say that this project hasn’t been without its problems?

The simplest summary is: web development in government has generally had little respect for the Tim Berners-Lee’s design principle of least power.

While sites abound with complicated Java mapping widgets, visually lush table-based designs and so on, there is almost no investment in pairing that with simple and convenient access to the underlying data in direct, simple, machine-readable way. This is particularly poignant for projects that have started out with high ideals, but have lost funding; all the meticulous work they have expended in creating their rich designs can go to waste if the site design only works in Netscape Navigator 4.

Making simple data sets available is timeless insurance against the shifting ephemeral fads of browser standards, and this season’s latest widget technology, but it’s something few have time for. That line of reasoning is particularly important for our own experimental pilot project. We have been lucky, unlike some of our partners, in that we have designed our site from the ground up to support easy data export. (You might well ask, though, if we can’t turn that functionality on for legal reasons, have we really made any real progress).

As everyone knows, pulling together datasets from different places is just a world of pain. But it is a problem that needs to be solved for any of the future things all of us want to do to get anywhere. Whilst we could insist on standards, what we wanted to experiment with here was how far we could get without mandating standards – because in the real world, especially with government data, a lot of data is collected for a single purpose and not considered for sharing and cross-mixing.

We’d love plain structured data in a recognised format, but it isn’t generally available. (RDF, OAI-PMH, ad hoc JSON over REST, KML – even undocumented XML with sensibly named elements will do) Instead, what there usually is are poorly marked up HTML sites, or databases full of inconsistent character encodings, that need to be scraped – or even data that we need to stitch together from several different sources to re-assemble the record in our partner’s database because their system won’t let them export it in one chunk. Elsewhere we’ve had nice Dublin Core available over OAI, but even once all the data is in, getting it to play nicely together is tricky, and parsing Dublin-core’s free-text fields has still been problematic.

In our standards-free world, there is also the problem of talking back.

Often we’re faced with the dilemma that we believe that we have in some way value-added to the data we have been given – but we have no way of easily communicating that back to its source.

Maybe we’ve found inconsistencies and errors in the data we have imported, or given “blobs” of data more structure, or our proofreaders have picked up some spelling mistakes. We can’t automatically encode our data back into the various crazy formats it comes in, (well, that it’s twice as much work!), and even do we invest the time on that if there is no agreed way of communicating suggested changes? Or what if the partner in question has lost funding and doesn’t have time to incorporate updates no matter how we provide them?

This is a tricky problem without an easy solution.

What does it run on?

Behind the scenes the site is built pretty much with open spource choices. It was built using on Python using the Django framework, and PostgresQL’s geographic extension postGIS (the combination known as Geodjango).

For the interactive mapping it uses Modest Maps – which allows us to change between tile providers as needed – and everything is pretty modular and re-purposable, and a whole bunch of custom file-system based tile-metadata service code.

Since we have data coming from lots of different providers with very different sets of fields, we store data internally in a general format which can handle arbitrary data – the EAV pattern – although we get more mileage out of our version because of Django’s sophisticated support for data model subclassing.

We have also used Reuters’ Open Calais to cross-map and relate articles initially whilst a bunch of geocoders go to work making sense of some pretty rough geo-data.

We use both the State Government supplied geocoder from the New South Wales government’s Spatial Information Exchange, and Google’s geocoder to fill the gaps.

And we use the Google Analytics, plus the Google Analytics Data Export API to be able to deliver contributor-specific usage data.

We use an extensive list of open-source libraries to make all this work, many of which we have committed patches to along the way.

We do our data parsing with

  • phpserialize for python for rolling quick APIs with out PHP-using friends
  • PyPdf for reading PDFs
  • pyparsingfor parsing specialised formats (e.g. broken “CSV”)
  • Beautiful Soup for page scraping
  • lxml for XML document handling
  • suds for SOAP APIs (and it is absolutely the best, easiest and most reliable python SOAP client out there

Our search index is based off whoosh, with extensive bug fixes by our friendly neighbourhood search guru Andy

We’ve also created some of our own which have been mentioned here before:

  • python-html-sanitizer takes our partners’ horrifically broken or embedded-style-riddled html and makes it something respectable. (based off the excellent html5lib as well as Beautiful Soup)
  • django-ticket is a lightweight DB-backed ticket queue optimised for efficient handling of resource-intensive tasks, like semantic analysis.


So, go an have a play.

We know there are still a few things that don’t quite work but we figure that your eyes might see things different to us. We’re implementing a bunch of UI fixes in the next fortnight too so you might want to check back in a fortnight and see what has improved. Things move fast on the web,

Collection databases Metadata Semantic Web

OPAC – Connecting collections to WorldCat Identities

If you were at the National Library of Australia’s annual meeting a while back then you might have spotted Thom Hickey from OCLC mentioning that the Powerhouse Museum has started to use the WorldCat Identities to connect people in the collection to their identity records and library holdings in WorldCat.

This is now public in an early alpha form.

Here’s an example from a collection record.

Tucked away in the automatically generated metadata (using Open Calais) are some links from people to their World Cat Identities record over at World Cat – if such a record exists. At the moment there isn’t a lot of disambiguation between people of the same name going on, so there are quite a few false positives.

In this example, Geoffrey C Ingleton now links to his record on World Cat Identities.

In the alpha stage all this means is that visitors can now connect from a collection record to the name authority file and thence, on World Cat, to library holdings (mostly books) by or about the people mentioned in that collection record. Later you’ll be able to a whole lot more . . . we’re using the World Cat API and we’ve got a jam-packed development schedule over the next few summer months (it is cooler in the office than out of it!).

Not only does this allow visitors to find more, it also allows the Powerhouse to start to add levels of ranking to the person data identified by Open Calais – an important step in putting that auto-generated metadata to better use. Equally importantly, it opens the door to a whole new range of metadata that can associated with an object record in our collection. Consider the possibilities for auto-generated bibliographies, or even library-generated additional classification metadata.

For those excited by the possibilities offered by combining the collective strengths of each partner in the LAM (libraries, archives, museums) nexus then this should be a good example of a first step towards mutual metadata enhancement.

We’re also very excited about the possibilities that the National Library of Australia’s People Australia project holds in this regard too.

Interactive Media Semantic Web User experience

More powerful browsers – Mozilla Labs Ubiquity

Mozilla Labs has released Aza Raskin’s Ubiquity in an early alpha form. This is a glimpse into a future world of browser technology which brings notions of the semantic web directly into the browser and connects the dots between websites – not from a provider perspective, but from a user perspective.

Ubiquity for Firefox from Aza Raskin on Vimeo.

Metadata Semantic Web

Collaborative collective classificiation – BBC Labs on using Wikipedia as metadata

Chris Sizemore at the BBC’s Radio Labs demonstrates an experiment in automated metadata, much akin to Open Calais.

Sizemore has taken Wikipedia and has built a simple web application that uses Wikipedia to disambiguate entities in a block of text and suggest broad categories for the content. Because Wikipedia has broad coverage of topics and deep coverage of specific niches, it can provide, as Sizemore writes, for some areas (especially popular culture), a good enough data source for automated classification.

Here’s Sizemore’s methodology –

1. Download entire contents of the English language Wikipedia (careful, that’s a large 4GB+ xml file!)

2. Parse that compressed XML file into individual text files, one per Wikipedia article (and this makes things much bigger, to the tune of 20GB+, so make sure you’ve got the hard drive space cleared)

3. Use a Lucene indexer to create a searchable collection (inc. term vectors) of your new local Wikipedia text files, one Lucene document per Wikipedia article

4. Use Lucene’s ‘MoreLikeThis’ to compare the similarity of a chunk of your own text content to the Wikipedia documents in your new collection

5. Treat the ranked Wikipedia articles returned as suggested categories for your text

Basically what is going on here is that the text you wish to classify is compared to Wikipedia articles and the articles with the ‘closest match’ in terms of content, have their URLs thrown back as potential classification categories.

Combine this with Open Calais and there will be some very interesting results across a broad range of text datasets.

As regular readers will know, we’ve been experimenting quite a bit with Open Calais at the Powerhouse with some exciting initial results. We’ve been looking at the potential of Calais in combination with other data sources including Wikipedia/dbPedia/Freebase and we’ll be watching Sizemore’s experiment with interest.

Perhaps my throwaway line in recent presentations that ‘humans should never have to create metadata’ might actually be becoming closer to a reality.

MW2008 Semantic Web

The museum APIs are coming – some thoughts on interoperability

At MW08 there was the beginnings of a push amongst the technically oriented for the development of APIs for museum data, especially collections. Driven in part by discussions and early demonstrations of semantic web applications in museums, the conceptual work of Ross Parry, and the presence of Eric Miller and Brian Sletten of Zepheria; Aaron Straup Cope and George Oates of Flickr, MW08 might well be a historic turning point for the sector in terms of data interoperability and experimentation.

Since April there has been a lot of movement, especially in the UK.

The ‘UK alpha tech team’ of Mike Ellis, Frankie Roberto, Fiona Romeo, Jeremy Ottevanger, Mia Ridge are leading the charge all working on various ways of connecting, extracting and visualising data from the Science Museum, Museum of London and the National Maritime Museum in new ways. Together with them and a few other UK commercial sector folk, I’ve been contributing to a strategy wiki around making a case for APIs in museums.

Whilst the tech end of things is (comparatively) straight forward, the strategic case for an API is far more complex to make. As we fiddle, though, others make significant progress.

Already a community project, dbPedia, has taken the content of Wikipedia and made it available as an open database. What this means is that it is now possible to make reasonably complex semantic queries of Wikipedia – something I’m yet to see done on a museum collection. There are a whole range of examples and mini-web applications already built to demonstrate queries like “people born in Paris” or “people influenced by Nietzsche“. More than this, though, are the exciting opportunities to use Wikipedia’s data and combine it with other datasets.

What should be very obvious is that if Wikipedia’s dataset is made openly available for combining with other datasets then, much as Wikipedia already draws audiences away from museum sites, then their dataset made usable in other ways, will draw even more away. You might well ask why similar complex queries are so hard to make in our own collection databases? “Show me all the artwork influenced by Jackson Pollock?”

On June 19 the MCG’s Museums on the Web UK takes place at the University of Leicester with the theme of “Integrate, federate, aggregate“. There’s going to be some lovely presentations there – I expect Fiona Romeo will be demoing some lovely work they’ve been doing and Frankie Roberto will be reprising his high entertaining MW08 presentation too.

The day before, like the MCGUK07 conference, there will be a mashup day beforehand. Last year’s mashup day produced a remarkable number of quick working prototypes drawing on data sources provided by the 24 Hour Museum (now Culture24). This year the data looks like it will be coming from the collection databases of some of the UK nationals.

Already Box UK and Mike Ellis have whipped up a really nice demonstration of data combining – done by scraping the websites of the major museums with a little bit of PHP code. Even better, the site provides XML feeds and I expect that it will be a major source of mashups at MCG UK.

I like the FAQ that goes along with the site. Especially this –

Q: Doesn’t this take traffic away from the individual sites?

We don’t think so, but not many studies have been done into how “off-site” browsing affects the “in-site” metrics. Already, users will be searching for, consuming, and embedding your images (and other content) via aggregators such as Google Images. This is nothing new.

Also, ask yourself how much of your current traffic derives from users coming to explicitly browse your online collections?

The aim is that by syndicating your content out in a re-usable manner, whilst still retaining information about its source, an increasing number of third-party applications can be built on this data, each addressing specific user needs. As these applications become widely used, they drive traffic to your site that you otherwise wouldn’t have received: “Not everyone who should be looking at collections data knows that they should be looking at collections data”.

I’ve spoken and written about this issue of metrics previously, and these and the control issues need to be sorted out if there is going to be any real traction in the sector.

Unlike the New York Times (who apparently announced an API recently), and the notable commercial examples like Flickr, the museum sector doesn’t have a working (business) model for their collections other than a) exhibitions, b) image sales and possibly c) research services.

Now back to that semantic query, wouldn’t it be useful if we could do this – “Play me all the music videos of singles that appear on albums whose record cover art was influenced by Jackson Pollock?”. This could, of course be done by combining the datasets of, say the Tate, Last.FM, Amazon and YouTube – the missing link being the Tate.

Collection databases Geotagging & mapping MW2008 Search Semantic Web

MW2008 – Data shanty towns, cross-search and combinatory approaches

One of the popular sessions at MW2008 in Montreal was a double header featuring Frankie Roberto and myself talking about different approaches to data combining across multiple institutions.

Data combining was a bit of a theme this year with Mike Ellis, Brian Kelly and others talking mashups; Ross Parry, Eric Miller and Brian Sletten all talking ‘semantic web’; and Terry Makewell and Carolyn Royston demonstrating the early prototype of the NMOLP cross search.