Collection databases Developer tools Metadata

OPAC2.0 – OpenCalais meets our museum collection / auto-tagging and semantic parsing of collection data

Today we went live with another one of the new experimental features of our collection database – auto-generation of tags based on semantic parsing.

Throughout the Museum’s collection database you will now find, in the right hand column of the more recently acquired objects (see a quick sample list), a new cluster of content titled “Auto-generated tags”.

We have been experimenting with Reuters’ OpenCalais web service since it launched in January. Now we have made a basic implementation of it applied to records in our collection database, initially as a way of generating extra structured metadata for our objects. We can extract proper names, places (by continent, country, region, state and city), company names, technologies and specialist terms, from object records all without requiring cataloguers to catalogue in this way. Having this data extracted makes it much easier for us to connect objects by manufacturers, people, and places within our own collection as well as to external resources.

Here’s a brief description of what OpenCalais is in a nutshell from their FAQ

From a user perspective it’s pretty simple: You hand the web service unstructured text (like news articles, blog postings, your term paper, etc) and it returns semantic metadata in RDF format. What’s happening in the background is a little more complicated.

Using natural language processing and machine learning techniques, the Calais web service looks inside your text and locates the entities (people, places, products, etc), facts (John Doe works for Acme Corp) and events (Jane Doe was appointed as a Board member of Acme Corp) in the text. Calais then processes the entities, facts and events extracted from the text and returns them to the caller in RDF format.

Whilst we store the RDF triples and unique hash, we are not making use of these beyond display right now. There is a fair bit of ‘cleaning up’ we have to do first, and we’d like to enlist your help so read on.

Obviously the type of content that we are asking OpenCalais to parse is complex. Whilst it is ideally suited to the more technical objects in our collection as well as our many examples of product design, it struggles with differentiating between content on some object records.

Here is a good example from a recent acquisition of amateur radio equipment used in the 1970s and 1980s.

The OpenCalais tags generated are as follows –

The bad:

The obvious errors which need deleting are the classification of “Ray Oscilloscope” as a person (although that might be a good name for my next avatar!); “Amateur Microprocessor Teleprinter Over Radio” as a company; the rather sinister “Terminal Unit” as an organisation; and the meaningless “metal” as an industry term.

We have included a simple ‘X’ to allow users to delete the ones that are obviously incorrect and will be tracking its use.

These errors and other like them reveal OpenCalais’ history as Clearforest in the business world. The rules it applies when parsing text as well as the entities that it is ‘aware’ of are rooted in the language of enterprise, finance and commerce.

The good:

On the otherhand, by making all this new ‘auto-generated’ tag data available, users can now traverse our collection in new ways, discovering connections between objects that previous remained hidden deep in blocks of text.

Currently clicking any tag will return a search result for that term in the rest of our collection. In a few hours of demonstrations to registrars and cataloguers today many new connections between objects were discovered, and people, who we didn’t expect to be mentioned in our collection documentation, revealed.

Help us:

Have a play with the auto-tags and see what you can find. Feel free to delete incorrect auto-tags.

We will be improving their operation over the coming weeks, but hope that this is a useful demonstration of some of the potential lying dormant in rich collection records and a real world demonstration of what the ‘semantic web’ might begin to mean for museums. It is important to remember that there is no way that this structured data could be generated manually – the volume of legacy data is too great and the burden on curatorial and cataloguing staff would be too great.

3 replies on “OPAC2.0 – OpenCalais meets our museum collection / auto-tagging and semantic parsing of collection data”

Tom Tague from Calais here.

Wow. We love it. We had a whole crowd of people hunched over a screen playing with it this afternoon. Always a big boost to see your tools actually deployed in the real world.

Please give us updates as you learn about what works well and what doesn’t. We are continuously updating the metadata generation engine and have the opportunity to fix (some) things pretty quickly.

Wouldn’t it be interesting to see this deployed across multiple organization’s collections and consolidated into a single view? You could search / navigate by similar objects across the whole collection universe.

Let us know if we can help.


Nice work Seb (and team). I’ve been musing on how we might use OpenCalais (and when) and whilst what you’ve done is obviously a first stab, to see what’s useful and what could be improved, it’s really exciting. Am I right in thinking that the tags from Calais aren’t necessarily terms found in that text? It looks that way, and I figured that’s how it’s useful i.e. it doesn’t just identify terms of a given type, but returns related terms (Sydney or cellular telephone, for instance). In which case, do the links for each of those terms use the RDF you’ve stored as part of their search, or is this a straighforward OPAC search?
Good work, you’ve put down a marker (again)!

Hi Jeremy

Firstly, OpenCalais makes explicit the relationships between things and this is what makes it really useful. These relationships do not pre-exist in an explicit way in most of our collection records, no matter however rich they are. It also makes a good stab at disambiguating some of the content.

Secondly, we use the RDF but currently don’t expose this in the front end – mainly because we’re still working on making it useful. The RDF becomes particularly useful when we start talking beyond our own PHM collection – because we have no control over external search, but do have a lot of control over internal search. And that’s what we’re working on now (and will have things to show very soon).


Comments are closed.