API Collection databases Metadata open content Semantic Web

Things clever people do with your data #65535: Introducing ‘Free Your Metadata’

Last year Seth van Hooland at the Free University Brussels (ULB) approached us to look at how people used and navigated our online collection.

A few days ago Seth and his colleague Ruben Verborgh from the University Ghent launched Free Your Metadata – a demonstrator site for showing how even irregular metadata can have valued to others and how, if it is released rather than clutched tightly onto (until that mythical day when it is ‘perfect’), it can be cleaned up and improved using new software tools.

What’s awesome is that Seth & Ruben used the Powerhouse’s downloadable collection datafile as the test data for the project.

Here’s Seth and his team talking about the project.

F&N: What made the Powerhouse collection attractive for use as a data source?

Number one, it’s available for everyone and therefore our experiment can be repeated by others. Otherwise, the records are very representative for the sector.

F&N: Was the data dump more useful than the Collection API we have available?

This was purely due to the way Google Refine works: on large amounts of data at once. But also, it enables other views on the data, e.g., to work in a column-based way (to make clusters). We’re currently also working on a second paper which will explain the disadvantages of APIs.

F&N: What sort of problems did you find with our collection?

Sometimes really broad categories. Other inconveniences could be solved in the cleaning step (small textual variations, different units of measurement). All issues are explained in detail in the paper (which will be published shortly). But on the whole, the quality is really good.

F&N: Why do you think museums (and other organisations) have such difficulties doing simple things like making their metadata available? Is there a confusion between metadata and ‘images’ maybe?

There is a lot of confusion about what the best way is to make metadata available. One of the goals of the Free Your Metadata initiative, is to put forward best practices to do this. Institutions such as libraries and museums have a tradition to only publish information which is 100% complete and correct, which is more or less impossible in the case of metadata.

F&N: What sorts of things can now be done with this cleaned up metadata?

We plan to clean up, reconcile, and link several other collections to the Linked Data Cloud. That way, collections are no longer islands, but become part of the interlinked Web. This enables applications that cross the boundaries of a single collection. For example: browse the collection of one museum and find related objects in others.

F&N: How do we get the cleaned up metadata back into our collection management system?

We can export the result back as TSV (like the original result) and e-mail it. Then, you can match the records with your collection management system using records IDs.

Go and explore Free Your Metadata and play with Google Refine on your own ‘messy data’.

If you’re more nerdy you probably want to watch their ‘cleanup’ screencast where they process the Powerhouse dataset with Google Refine.

Collection databases Digitisation Metadata Powerhouse Museum websites

Australia Dress Register – public site goes live

The first iteration of the public front end of the Australian Dress Register went live a few weeks back. This release makes visible much of the long data gathering process with regional communities that began in 2008 and continues as more garments are added to the Register over time.

The ADR is a good example of a distributed collection – brought together through regional partnerships. Many of the garments on the site are held by small regional museums or, in some cases, private collectors and families. It is only through their rigorous documentation and then aggregation that it becomes possible to tell the national stories that relate to changes in clothing over the last 200 years.

The ADR extends the standard collection metadata schema that we use for documentation at the Powerhouse with a large range of specific data fields for garment measurements and the quality of preservation. These have been added to allow costume and social history researchers to explore the data in greater detail and granularity. A good way to see the extra level of detail in the ADR is to compare a record on ADR with the same object record in the host institution’s own collection (where it is available online).

Here’s the child’s fancy dress costume from 1938 on the Powerhouse site, side by side with the same object on the ADR. (Click to view the full records)

The Resources section of the site provides volunteers and contributors without the capacity of the major capital city museums to better understand the best practice methods of preserving, documenting and digitising their garments along with a range of simple how-to videos.

The Browse and Search uses Solr on the backend and offers extensive faceting (Here’s just the discoloured garments with buttons). There are multiple views for search results with configurable list and grid views, and relevance, recency and alphabetical result ordering.

The Timeline is one of the visual highlights of the site, along with being rather cool from a technical perspective too. As the collection grows the Timeline and Browsing features will become more valuable to traverse the rich content.

There’s a lot more to go with this site and you’ll be seeing many more records contributed from around the country over the coming months.

API Collection databases Conceptual Interviews Metadata

Making use of the Powerhouse Museum API – interview with Jeremy Ottevanger

As part of a series of ‘things people do with APIs’ here is an interview I conducted with Jeremy Ottevanger from the Imperial War Museum in London. Jeremy was one of the first people to sign up for an API key for the Powerhouse Museum API – even though he was on the other side of the world.

He plugged the Powerhouse collection into a project he’s been doing in his spare time called Mashificator which combines several other cultural heritage APis.

Over to Jeremy.

Q – What is Mashificator?

It’s an experiment that got out of hand. More specifically, it’s a script that takes a bit of content and pulls back “cultural” goodies from museums and the like. It does this by using a content analysis service to categorise the original text or pull out some key words, and then using some of these as search terms to query one of a number of cultural heritage APIs. The idea is to offer something interesting and in some way contextually relevant – although whether it’s really relevant or very tangential varies a lot! I rather like the serendipitous nature of some of the stuff you get back but it depends very much on the content that’s analysed and the quirks of each cultural heritage API.

There are various outputs but my first ideas were around a bookmarklet, which I thought would be fun, and I still really like that way of using it. You could also embed it in a blog, where it will show you some content that is somehow related to the post. There’s a WordPress plugin from OpenCalais that seems to do something like this: it tags and categorises your post and pulls in images from Flickr, apparently. I should give it a go! Zemanta and Adaptive Blue also do widgets, browser extensions and so on that offer contextually relevant suggestions (which tend to be e-commerce related) but I’d never seen anything doing it with museum collections. It seemed an obvious mashup, and it evolved as I realised that it’s a good way to test-bed lots of different APIs.

What I like about the bookmarklet is that you can take it wherever you go, so whatever site you’re looking at that has content that intrigues you, you can select a bit of a page, click the bookmarklet and see what the Mashificator churns out.

Mashificator uses a couple of analysis/enrichment APIs at the moment (Zemanta and Yahoo! Terms Extractor) and several CH APIs (including the Powerhouse Museum of course!) One could go on and on but I’m not sure it’s worth while: at some point, if this is helpful to anyone, it will be done a whole lot better. It’s tempting to try to put a contextually relevant Wolfram Alpha into an overlay, but that’s not really my job, so although it would be quite trivial to do geographical entity extraction and show amap of the results, for example, it’s going too far beyond what I meant to do in the first place so I might draw the line there. On the other hand, if the telly sucks on Saturday night, as it usually does, I may just do it anyway.

Beside the bookmarklet, my favourite aspect is that I can rapidly see the characteristics of the enrichment and content web services.

Q – Why did you build it?

I built it because I’m involved with the Europeana project, and for the past few years I’ve been banging the drum for an API there. When they had an alpha API ready for testing this summer they asked people like me to come up with some pilots to show off at the Open Culture conference in October. I was a bit late with mine, but since I’d built up some momentum with it I thought I may as well see if people liked the idea. So here you go…

There’s another reason, actually, which is that since May (when I started at the Imperial War Museum) it’s been all planning and no programming so I was up for keeping my hand in a bit. Plus I’ve done very little PHP and jQuery in the past, so this project has given me a focussed intro to both. We’ll shortly be starting serious build work on our new Drupal-based websites so I need all the practice I can get! I still no PHP guru but at least I know how to make an array now…

Q – Most big institutions have had data feeds – OAI etc – for a long time now, so why do you think APIs are needed?

Aggregation (OAI-PMH‘s raison d’etre) is great, and in many ways I prefer to see things in one place – Europeana is an example. For me as a user it means one search rather than many, similarly for me as a developer. Individual institutions offering separate OPACs and APIs doesn’t solve that problem, it just makes life complicated for human or machine users (ungrateful, aren’t I?).

But aggregation has its disadvantages too: data is resolved to the lowest common denominator (though this is not inevitable in theory); there’s the political challenge of getting institutions to give up some control over “their” IP; the loss of context as links to other content and data assets are reduced. I guess OAI doesn’t just mean aggregation: it’s a way for developers to get hold of datasets directly too. But for hobbyists and for quick development, having the entirety of a dataset (or having to set up an OAI harvester) is not nearly as useful or viable as having a simple REST service to programme against, which handles all the logic and the heavy lifting. And conversely for those cases where the data is aggregated, that doesn’t necessarily mean there’ll be an API to the aggregation itself.

For institutions, having your own API enables you to offer more to the developer community than if you just hand over your collections data to an aggregator. You can include the sort of data an aggregator couldn’t handle. You can offer the methods that you want as well as the regular “search” and “record” interfaces, maybe “show related exhibitions” or “relate two items” (I really, really want to see someone do this!) You can enrich it with the context you see fit – take Dan Pett’s web service for the Portable Antiquities Scheme in the UK, where all the enrichment he’s done with various third party services feeds back into the API. Whether it’s worthwhile doing these things just for the sake of third party developers is an open question, but really an API is just good architecture anyway, and if you build what serve’s your needs it shouldn’t cost that much to offer it to other developers too – financially, at least. Politically, it may be a different story.

Q – You have spent the past while working in various museums. Seeing things from the inside, do you think we are nearing a tipping point for museum content sharing and syndication?

I am an inveterate optimist, for better or worse – that’s why I got involved with Europeana despite a degree of scepticism from more seasoned heads whose judgement I respect. As that optimist I would say yes, a tipping point is near, though I’m not yet clear whether it will be at the level of individual organisations or through massive aggregations. More and more stuff is ending up in the latter, and that includes content from small museums. For these guys, the technical barriers are sometimes high but even they are overshadowed by the “what’s the point?” barriers. And frankly, what is the point for a little museum? Even the national museum behemoths struggle to encourage many developers to build with their stuff, though there are honourable exceptions and it’s early days still – the point is that the difficulty a small museum might have in setting up an API is unlikely to be rewarded with lots of developers making them free iPhone apps. But through an aggregator they can get it in with the price.

One of my big hopes for Europeana was that it would give little organisations a path to get their collections online for the first time.
Unfortunately it’s not going to do that – they will still have to have their stuff online somewhere else first – but nevertheless it does give them easy access both to audiences and (through the API) to third party developers that otherwise would pay them no attention. The other thing that CHIN, Collections Australia, Digital NZ, Europeana and the like do, is offer someone big enough for Google and the link to talk to. Perhaps this in itself will end up with us settling on some de facto standards for machine-readable data so we can play in that pool and see our stuff more widely distributed.

As for individual museums, we are certainly seeing more and more APIs appearing, which is fantastic. Barriers are lowering, there’s arguably some convergence or some patterns emerging for how to “do” APIs, we’re seeing bold moves in licensing (the boldest of which will always be in advance of what aggregators can manage) and the more it happens the more it seems like normal behaviour that will hopefully give others the confidence to follow suit. I think as ever it’s a matter of doing things in a way that makes each little step have a payoff. There are gaps in the data and services out there that make it tricky to stitch together lots of the things people would like to do with CH content at the moment – for example, a paucity of easy and free to use web services for authority records, few CH thesuari, no historical gazetteers. As those gaps get filled in the use of museum APIs will gather pace.

Ever the optimist…

Q – What is needed to take ‘hobby prototypes’ like Mashificator to the next level? How can the cultural sector help this process?

Well in the case of the Mashificator, I don’t plan a next level. If anyone finds it useful I suggest they ask me for the code or do it themselves – in a couple of days most geeks would have something way better than this. It’s on my free hosting and API rate limits wouldn’t support it if it ever became popular, so it’s probably only ever going to live in my own browser toolbar and maybe my own super-low-traffic blog! But in that answer you have a couple things that we as a sector could do: firstly, make sure our rate limits are high enough to support popular applications, which may need to make several API calls per page request; secondly, it would be great to have a sandbox that a community of CH data devotees could gather around/play in. And thirdly, in our community we can spread the word and learn lessons from any mashups that are made. I think actually that we do a pretty good job of this with mailing lists, blogs, conferences and so on.

As I said before, one thing I really found interesting with this experiment was how it let me quickly compare the APIs I used. From the development point of view some were simpler than others, but some had lovely subtleties that weren’t really used by the Mashificator. At the content end, it’s plain that the V&A has lovely images and I think their crowd-sourcing has played its part there, but on the other hand if your search term is treated as a set of keywords rather than a phrase you may get unexpected results… YTE and Zemanta each have their own characters, too, which quickly become apparent through this. So that test-bed thing is really quite a nice side benefit.

Q – Are you tracking use of Mashificator? If so, how and why? Is this important?

Yes I am, with Google Analytics, just to see if anyone’s using it, and if when they come to the site they do more than just look at the pages of guff I wrote – do they actually use the bookmarklet? The answer is generally no, though there have been a few people giving it a bit of a work-out. Not much sign of people making custom bookmarklets though, so that perhaps wasn’t worthwhile! Hey, lessons learnt.

Q – I know you, like me, like interesting music. What is your favourite new music to code-by?

Damn right, nothing works without music! (at least, not me.) For working, I like to tune into WFMU, often catching up on archive shows by Irene Trudel, Brian Turner & various others. That gives me a steady stream of quality music familiar and new. As for recent discoveries I’ve been playing a lot (not necessarily new music, mind), Sharon van Etten (new), Blind Blake (very not new), Chris Connor (I was knocked out by her version of Ornette Coleman’s “Lonely Woman”, look out for her gig with Maynard Ferrguson too). I discovered Sabicas (flamenco legend) a while back, and that’s a pretty good soundtrack for coding, though it can be a bit of a rollercoaster. Too much to mention really but lots of the time I’m listening to things to learn on guitar. Lots of Nic Jones… it goes on.

Go give Mashificator a try!

Collection databases Metadata Semantic Web

OPAC – Connecting collections to WorldCat Identities

If you were at the National Library of Australia’s annual meeting a while back then you might have spotted Thom Hickey from OCLC mentioning that the Powerhouse Museum has started to use the WorldCat Identities to connect people in the collection to their identity records and library holdings in WorldCat.

This is now public in an early alpha form.

Here’s an example from a collection record.

Tucked away in the automatically generated metadata (using Open Calais) are some links from people to their World Cat Identities record over at World Cat – if such a record exists. At the moment there isn’t a lot of disambiguation between people of the same name going on, so there are quite a few false positives.

In this example, Geoffrey C Ingleton now links to his record on World Cat Identities.

In the alpha stage all this means is that visitors can now connect from a collection record to the name authority file and thence, on World Cat, to library holdings (mostly books) by or about the people mentioned in that collection record. Later you’ll be able to a whole lot more . . . we’re using the World Cat API and we’ve got a jam-packed development schedule over the next few summer months (it is cooler in the office than out of it!).

Not only does this allow visitors to find more, it also allows the Powerhouse to start to add levels of ranking to the person data identified by Open Calais – an important step in putting that auto-generated metadata to better use. Equally importantly, it opens the door to a whole new range of metadata that can associated with an object record in our collection. Consider the possibilities for auto-generated bibliographies, or even library-generated additional classification metadata.

For those excited by the possibilities offered by combining the collective strengths of each partner in the LAM (libraries, archives, museums) nexus then this should be a good example of a first step towards mutual metadata enhancement.

We’re also very excited about the possibilities that the National Library of Australia’s People Australia project holds in this regard too.

Metadata open content Web 2.0

DigitalNZ – API access to New Zealand collections launches

One of the best things I saw at the National Digital Forum in Auckland last week was DigitalNZ. Being a Kiwi myself, I am immensely proud that New Zealand has leapt forward and produced a federated collection product that aggregates and then allows access through a web interface and an open API. That it has brought together very disparate partners is also very impressive.

I spoke to the team behind the very choice project who are based at the National Library of New Zealand – Virginia Gow ,Courtney Johnston, Gordon Paynter, Fiona Rigby, Lewis Brown, Andy Neale, Karen Rollitt – who all contributed the following interview.

Tell me about the genesis of DigitalNZ?

Digital New Zealand Ā-Tihi o Aotearoa is an ongoing programme that is in implementation project phase. It emerged as a response to the difficulties many New Zealand public and community organisations faced in making their content visible to New Zealanders amid the swell of international content available on the Web. In 2007 it received four years government funding as a flagship action of the New Zealand Digital Content Strategy, Creating a Digital New Zealand.

The Wave 1 implementation project has been led by National Library but is a very collaborative project. We’ve got representatives from education, culture and heritage, broadcasting, geospatial information, Te Puni Kokiri (Ministry of Māori Development) and the National Digital Forum on our Steering Committee. The project began earlier this year and then really ramped up in June 2009. The project aimed to set up the ongoing infrastructure for the programme and to deliver with exemplars that demonstrate what is possible when there is concerted work to improve access and discovery of New Zealand content.

We’ve taken a test lab approach – we’ve identified and worked on potential solutions to some of the many issues that prevent access, use and discovery of New Zealand digital content. Some of these areas have included licensing, metadata quality, improving access to advice around standards, formats and protocols and the development of a draft framework to help organisations prioritise content for digitisation.

It is important to us that DigitalNZ isn’t seen as ‘just another website’.

We are working with New Zealand organisations, communities and individuals to aggregate their metadata and help make hard to find content available for discovery, use and re-use in unique ways.  We’ve developed three innovative tools that are ‘powered by Digital New Zealand’ and fuelled by the metadata and content from the many different content providers that we’re working with.

DigitalNZ is made up of:

1) A customisable search builder lets people design their own search widget to find the type of New Zealand content they are interested in – be it antique cars, pavlova or moustaches! People can flavour it and embed it on their own blogs and websites. We developed this to show new ways for people to discover and interact with New Zealand content and we especially wanted people to use the tools how and where they wanted. 

2) New Zealanders can craft their own commemoration of the 90th Anniversary of the Armistice using the Memory Maker – a tool that lets people remix film, photographs, objects and audio clips into a short video that can then be saved, shared, and embedded.  This example is helping us show what is possible when you can free the licensing of publicly available content for reuse and remixing. 

3) We’re using ‘API Magic’. We’ve developed an open API that enables developers to connect the metadata that fuels DigitalNZ with other data sources, enabling new digital experiences, knowledge, and understanding.

How did you manage to get each of the content owners to buy in to the project?

By lots and lots of talking, visiting, sharing and blogging!

We started by identifying and contacting a wide range of New Zealand content providers, building also on our existing professional networks and contacts as far as possible because time wasn’t a luxury on this project.

It was hard work because DigitalNZ was a completely abstract concept for many content providers until a few weeks ago. We didn’t even have that snazzy diagram explaining how it all fits together until we had gone live!

[That’s a cool magic hat!]

So we basically just committed ourselves to communicating (and communicating and communicating), being open with our information and honest about what we did and didn’t know each step of the way, and helping people out so it was as easy as possible for them to participate.

Content providers took different amounts of time to reach an ‘ah ha’ moment with us and to realise what this could potentially mean for them – “OK, so you’re like a giant free referral engine to my content” or “So I could basically use your tools to make my own search box for my site”. At the end of the day we aren’t doing this for us!

Face-to-face meetings were the most effective, as it meant we could just chat with people about the issues and problems we are all trying to solve. It was a great way for us to learn about people’s content too.

But we also glued ourselves to our inboxes and set up a private DigitalNZ content blog so content providers could talk directly to each other. The discussion of issues around licensing, for example, was great because it meant we didn’t have to do all the thinking and talking!

The private blog also allowed us to share sneak previews of wireframes and functionality that helped us build a better picture of what we were doing.

In the end we actually got more content providers to take a leap of faith with us than we were able to process in time for launch. There is a real commitment out there to increasing access to and use of New Zealand content. We just convinced people to take it a step further and try something new.

What technologies are you using behind the scenes?

The DigitalNZ Harvesting model is best described by this diagram that our Technical Architect Gordon Paynter has whipped up.

We started out hoping that OAI-PMH would be the best way to harvest. However, very few organisations are set up to do this and it was clear that we had to work on something easier. So we then worked setting up harvesters using XML Sitemaps and also for RSS feeds. The majority of our content contributors are using the XML sitemaps option.

The DigitalNZ system is in 3 parts: a backend, a metadata store and a front end. The backend harvester is written in Java and some legacy PHP (based on the open-source PKP Harvester project). The metadata store is in MySQL, using Solr to power the search. The front end, including the API, the website, widgets and so on, are in Ruby on Rails. The also uses the open source Radiant CMS.

We’ve also set up a DigitalNZ Kete for organisations to upload any content that doesn’t have a home elsewhere on the web. Kete (basket in Māori) is an open source community repository system (built on Ruby on Rails) that we can harvest using OAI-PMH.

One of the great things about Kete is the built-in Creative Commons licensing structure – our ‘technology’ (in the sense of tools) for dealing with the issue of uncertainty around what people can and can’t do with content.

We extended this by adding in the “No known copyright restrictions” statement as well – taking a leaf out of the wonderful Flickr Commons book. A number of Aotearoa People’s Network Libraries are using Kete to gather community treasures and we are including that content in DigitalNZ as it comes online.

The Memory Maker uses the Ideum Editor One which we have sitting on our server in Christchurch, New Zealand.

We’ve worked with three vendors (Boost New Media, 3Months and Codec) and have taken an agile development approach using Scrum. This was very successful way of working and it enabled us to complete our development with in 16 weeks from go to whoa. It was fast, furious and an absolute blast!

The search widget is really great – how are you expecting this to be used?

We think that it is going to be of really useful in education for teachers to use to define project resources or for kids to build into their own online projects. We also see application for libraries, museums and other organisations to use for setting up ‘find more’ options relating to specific exhibitions, topics or events. We’ve also had feedback from some content providers that they are considering it as their primary website search. We’re pretty delighted with that! We also really hope to see some unexpected uses as well.

Tell me something about the Memory Maker?

We think that these guys can tell you about the Memory Maker much better than us!

We ran the ‘Coming Home’ Memory Maker campaign to demonstrate what is possible when content providers ‘free up’ selected public cultural content for people to remix with permission; and used the remix editor to deliver the content to users. We filled the Memory Maker with content relating to celebrations for the 90th anniversary of Armistice Day on 11 November 2008. A number of National Digital Forum partners provided the content and the Auckland War Memorial Museum has been the wonderful host.

We’ve been delighted to watch as schools and other web users make their own commemorative videos out of New Zealand digital content – not by stealing it, but because they know they are allowed to and we made it easy for them.

Our detailed case study of the Memory Maker project describes all of the details and issues we worked with.
We’re hoping to work with others on new remix campaigns in the future.

What sorts of mashups are you hoping other developers will build using the API?

We’ve got a couple already – check out the Widget Gallery for Yahoo Pipes mashups of the DigitalNZ search combined Flickr and also a headlines search of StuffNZ (NZ website of newspapers and online news) over the DigitalNZ metadata.

We don’t have any specific expectations – just excitement about what is possible. We want to be surprised by what people come up with. The whole point of putting the open API out there is to drive others to make new, exciting things with the content that we’ve made available. DigitalNZ wants to share the love!

Go ahead and make us and our content providers happy.

We’re hoping that when people develop new things they’ll let us know so that we can make it available to others through the widget gallery and share it with others.

What other kind of work is DigitalNZ doing?

Another very important aspect of DigitalNZ is that we want to work with NZ organisations to improve understanding and knowledge about how to make their content, easier to find, share and use. One of the issues that we’ve come up against was metadata quality. The search tool has shown that search results can only be as good as the quality of the metadata that goes in. Working with people to improve their metadata will make the API stronger and also the discovery experiences for people.

The Contributors’ section of the site provides guidance on how to participate in DigitalNZ as well as good practice advice on content creation and management. The good practices guide are being developed across all aspects of the digital content lifecycle: through selecting to creating, managing, discovering, using and reusing (including licensing) as well as preservation. We’re interested in hearing from people that might be able to share expertise and perhaps help build up the material on the site.

We’re also working on an advisory service that will provide support and guidance across the spectrum of content creation and management issues that organisations are facing. This will be developed further over 2009 and we hope to include information, training and development, peer support, discussion forums, as well as draw on the knowledge of collective experience and wisdom out there.

Go and take a look at DigitalNZ!

Metadata Tools

A web citation tool – dealing with impermanent references

We’re all working hard to ensure that our own content is identified with persistent URLs – a referrer that will stand the test of time – but often when we are writing a paper we need to refer to someone else’s URL, most of which are not designed to be permanent.

Traditionally when we reference something on a website we put ‘accessed on X date’ but that is of little use to a reader who follows up a reference only to find the original has moved or gone.

That’s where WebCite comes in. WebCite is a bit like TinyUrl or any number of URL shortening services, a social bookmarking tool like, combined with a snapshotting tool. It provides a ‘shorter’ URL and it also keeps a copy of the entire page you have cited in its archive. This means that readers can read the exact same page, as it was when you were referencing it, at any time into the future – even if that page changes regularly (like the front page of a newspaper website).

You can also add custom DC metadata.

Here’s a WebCite capture of the Sydney Morning Herald’s front page as it was at the time of this post.

As you can see there are some problems in that it has been unable to capture the CSS to lay out the page properly, but for references to the text contained in a page it does a pretty good job.

Here’s a capture of an article from an online journal, D-Lib, which being predominantly text, works better.

There’s even a bookmarklet to add to your browser toolbar to make capturing even easier. Otherwise use the service manually via their archiving submission page. A submission takes about 20 seconds to capture.

Metadata Semantic Web

Collaborative collective classificiation – BBC Labs on using Wikipedia as metadata

Chris Sizemore at the BBC’s Radio Labs demonstrates an experiment in automated metadata, much akin to Open Calais.

Sizemore has taken Wikipedia and has built a simple web application that uses Wikipedia to disambiguate entities in a block of text and suggest broad categories for the content. Because Wikipedia has broad coverage of topics and deep coverage of specific niches, it can provide, as Sizemore writes, for some areas (especially popular culture), a good enough data source for automated classification.

Here’s Sizemore’s methodology –

1. Download entire contents of the English language Wikipedia (careful, that’s a large 4GB+ xml file!)

2. Parse that compressed XML file into individual text files, one per Wikipedia article (and this makes things much bigger, to the tune of 20GB+, so make sure you’ve got the hard drive space cleared)

3. Use a Lucene indexer to create a searchable collection (inc. term vectors) of your new local Wikipedia text files, one Lucene document per Wikipedia article

4. Use Lucene’s ‘MoreLikeThis’ to compare the similarity of a chunk of your own text content to the Wikipedia documents in your new collection

5. Treat the ranked Wikipedia articles returned as suggested categories for your text

Basically what is going on here is that the text you wish to classify is compared to Wikipedia articles and the articles with the ‘closest match’ in terms of content, have their URLs thrown back as potential classification categories.

Combine this with Open Calais and there will be some very interesting results across a broad range of text datasets.

As regular readers will know, we’ve been experimenting quite a bit with Open Calais at the Powerhouse with some exciting initial results. We’ve been looking at the potential of Calais in combination with other data sources including Wikipedia/dbPedia/Freebase and we’ll be watching Sizemore’s experiment with interest.

Perhaps my throwaway line in recent presentations that ‘humans should never have to create metadata’ might actually be becoming closer to a reality.

Folksonomies Metadata Web 2.0

24 hours later – Powerhouse on the Commons on Flickr

The first 24 hours of our presence on Commons on Flickr has been fascinating. I wrote about the launch yesterday but now let’s take a look at what has happened over night.

In short, we’ve been excited by the response. Here’s some quick figures.

Plenty of views (4777), and stacks of tags (175) – in such a short time. That’s more views in one day than the entire Tyrrell Collection would have previously gotten in a month. I’ve been really excited by the types of tags and the diversity of tags that have been added. One user has even added postcodes as tags. And, although we’ve had tagging available on our site for those same Tyrrell records, these tags far exceed those added on our own site in quantity and, arguably, quality. Obviously this has a lot to do with context.

Collection databases Developer tools Metadata

OPAC2.0 – OpenCalais meets our museum collection / auto-tagging and semantic parsing of collection data

Today we went live with another one of the new experimental features of our collection database – auto-generation of tags based on semantic parsing.

Throughout the Museum’s collection database you will now find, in the right hand column of the more recently acquired objects (see a quick sample list), a new cluster of content titled “Auto-generated tags”.

We have been experimenting with Reuters’ OpenCalais web service since it launched in January. Now we have made a basic implementation of it applied to records in our collection database, initially as a way of generating extra structured metadata for our objects. We can extract proper names, places (by continent, country, region, state and city), company names, technologies and specialist terms, from object records all without requiring cataloguers to catalogue in this way. Having this data extracted makes it much easier for us to connect objects by manufacturers, people, and places within our own collection as well as to external resources.

Here’s a brief description of what OpenCalais is in a nutshell from their FAQ

From a user perspective it’s pretty simple: You hand the web service unstructured text (like news articles, blog postings, your term paper, etc) and it returns semantic metadata in RDF format. What’s happening in the background is a little more complicated.

Using natural language processing and machine learning techniques, the Calais web service looks inside your text and locates the entities (people, places, products, etc), facts (John Doe works for Acme Corp) and events (Jane Doe was appointed as a Board member of Acme Corp) in the text. Calais then processes the entities, facts and events extracted from the text and returns them to the caller in RDF format.

Whilst we store the RDF triples and unique hash, we are not making use of these beyond display right now. There is a fair bit of ‘cleaning up’ we have to do first, and we’d like to enlist your help so read on.

Obviously the type of content that we are asking OpenCalais to parse is complex. Whilst it is ideally suited to the more technical objects in our collection as well as our many examples of product design, it struggles with differentiating between content on some object records.

Here is a good example from a recent acquisition of amateur radio equipment used in the 1970s and 1980s.

The OpenCalais tags generated are as follows –

The bad:

The obvious errors which need deleting are the classification of “Ray Oscilloscope” as a person (although that might be a good name for my next avatar!); “Amateur Microprocessor Teleprinter Over Radio” as a company; the rather sinister “Terminal Unit” as an organisation; and the meaningless “metal” as an industry term.

We have included a simple ‘X’ to allow users to delete the ones that are obviously incorrect and will be tracking its use.

These errors and other like them reveal OpenCalais’ history as Clearforest in the business world. The rules it applies when parsing text as well as the entities that it is ‘aware’ of are rooted in the language of enterprise, finance and commerce.

The good:

On the otherhand, by making all this new ‘auto-generated’ tag data available, users can now traverse our collection in new ways, discovering connections between objects that previous remained hidden deep in blocks of text.

Currently clicking any tag will return a search result for that term in the rest of our collection. In a few hours of demonstrations to registrars and cataloguers today many new connections between objects were discovered, and people, who we didn’t expect to be mentioned in our collection documentation, revealed.

Help us:

Have a play with the auto-tags and see what you can find. Feel free to delete incorrect auto-tags.

We will be improving their operation over the coming weeks, but hope that this is a useful demonstration of some of the potential lying dormant in rich collection records and a real world demonstration of what the ‘semantic web’ might begin to mean for museums. It is important to remember that there is no way that this structured data could be generated manually – the volume of legacy data is too great and the burden on curatorial and cataloguing staff would be too great.

Digitisation Metadata Web 2.0

Flickr Commons – mass exposure of historical images

As a lot of museums (and libraries) have been using Flickr in lightweight ways for various purposes from image storage to building community engagement for quite a while, it is exciting to see a new formal collaborative project between Flickr and a major institution launch.

Flickr Commons is a project between Flickr and the US Library of Congress. It provides a secondary point of access to some of the out-of-Copyright historical photo collections of the LoC. Whilst these photos have all existed on the LoC’s own website, they have been, like most image collections, only known to those audiences who are familiar with the work of the LoC already and are undertaking (‘serious’) research.

The project is beginning somewhat modestly, but we hope to learn a lot from it. Out of some 14 million prints, photographs and other visual materials at the Library of Congress, more than 3,000 photos from two of our most popular collections are being made available on our new Flickr page, to include only images for which no copyright restrictions are known to exist.

Placing these images on Flickr allows the images to reach a much broader audience and be connected with images of similar people, places and things in contemporary photography. Importantly, this audience’s labour can be engaged to assist in tagging and geolocating the images – work that the LoC is unable to do so efficiently or presumably as quickly.

As George Oates from Flickr writes

There are about 20 million unique tags on Flickr today. 20 million! They are the bread and butter of what makes our search work so beautifully. Simply by association, tags create emergent collections of words that reinforce meaning. You can see this in our clusters around words like tiger, sea, jump, or even turkey.

What if we could lend this wonderful power to some of the huge reference collections around the world? What if you could contribute your own description of a certain photo in, say, the Library of Congress’ vast photographic archive, knowing that it might make the photo you’ve touched a little easier to find for the next person?

This isn’t the first formal engagement between a library and Flickr. The National Library of Australia’s Picture Australia project set up a relationship to allow the community to upload contemporary images to Flickr and have them catalogued inside the NLA’s resource as well.

What is interesting about the Commons project is that it reverses this and releases back to the community a wealth of historical imagery that previously was hard to find. Flickr is a good match for the collections of images already available under this project – the pro-am photographic community is well represented in Flickr which bodes well for higher quality tagging and user generated content, Flickr already has a lot of ‘similar’ contemporary content with which these historical images can be linked, and of course Flickr’s API opens up some interesting possibilities for recombinatory projects.

No doubt many other organisations will be watching this closely to see what impact this has on the LoC’s reputation and image sales revenue. Also, for those who hold similar collections, how their own image sales revenue is affected. Likewise, others will ask whether these public domain resources should now also be replicated out to other image services as well, and when more public domain collections will be uploaded in a similar fashion.

More over at the Flickr blog and the LoC’s own blog.