API Collection databases

More on museum datasets, un-comprehensive-ness, data mining

(Another short response post)

Thus far we’ve not had much luck with museum datasets.

Sure, some of us have made our own internal lives easier by developing APIs for our collection datasets, or generated some good PR by releasing them without restrictions. In a few cases enthusiasts have made mobile apps for us, or made some quirky web mashups. These are fine and good.

But the truth is that our data sucks. And by ‘our’ I mean the whole sector.

Earlier in the year when Cooper-Hewitt released their collection data on Github under a Creative Commons Zero license, we were the first in the Smithsonian family to do so. But as PhD researcher Mia Ridge found after spending a week in our offices trying to wrangle it, the data itself was not very good.

As I said at the time of release,

Philosophically, too, the public release of collection metadata asserts, clearly, that such metadata is the raw material on which interpretation through exhibitions, catalogues, public programmes, and experiences are built. On its own, unrefined, it is of minimal ‘value’ except as a tool for discovery. It also helps remind us that collection metadata is not the collection itself.

One of the reasons for releasing the metadata was simply to get past the idea that it was somehow magically ‘valuable’ in its own right. Curators and researchers know this already – they’d never ‘just rely on metadata’, they always insist on ‘seeing the real thing’.

Last week Jasper Visser pointed to one of the recent SIGGRAPH 2012 presentations which had developed an algorithm to look at similarities in millions of Google Street View images to determine ‘what architectural elements of a city made it unique’. I and many others (see Suse Cairns) loved the idea and immediately started to think about how this might work with museum collections – surely something must be hidden amongst those enormous collections that might be revealed with mass digitisation and documentation?

I was interested a little more than most because one of our curators at Cooper-Hewitt had just blogged about a piece of balcony grille in the collection from Paris. In the blogpost the curator wrote about the grille but, as one commenter quickly pointed out, didn’t provide a photo of the piece in its original location. Funnily enough, a quick Google search for the street address in Paris from which the grille had been obtained quickly revealed not only Google Street View of the building but also a number of photos on Flickr of the building specifically discussing the same architectural features that our curator had written about. Whilst Cooper-Hewitt had the ‘object’ and the ‘metadata’, the ‘amateur web’ held all the most interesting context (and discussion).

So then I began thinking about the possibilities for matching all the architectural features from our collections to those in the Google Street View corpus . . .

But the problem with museum collections is that they aren’t comprehensive – even if their data quality was better and everything was digitised.

As far as ‘memory institutions’ go, they are certainly no match for library holdings or archival collections. Museums don’t try to be comprehensive, and at least historically they haven’t been able to even consider being so. Or, as I’ve remarked before, it is telling that the memory institution that ‘acquired’ the Twitter archive was the Library of Congress and not a social history museum.

11 replies on “More on museum datasets, un-comprehensive-ness, data mining”

A lot to think about here. The question of “What is curation?” comes up often, but is usually framed as, “How much of what goes on outside of our institutions is curation?” This would seem to reframe that question as, “What else should we be curating that we currently are not?” What outside resources should we be drawing from and tying into to better perform our missions?

And how? Like you said, our data is a mess. How do we clean it up to afford better interactions with outside datasets? Who is going to do this? We need data architects and curators and probably some people who can do a little bit of both.

Perhaps what stands out most for me though is, how do we do this as a field and not just as individual institutions? This represents a non-trivial technical problem. Some museums can hire some more technical staff to fill these gaps and start the transition to a more connected museum (which is still a difficult problem in itself). Most cannot. And even those who can will only solve their own problems in the process. They are unlikely to establish any kind of real standard, defacto or otherwise, that can be deployed across multiple institutions. Can we count on our contractors and vendors to help us solve this problem? (I tend to think we cannot.)

We clearly need an adjustment to our institutional perspective. But what else do we need? Do we need a consortium? A standards body? An advisory council? An open source project? All or none of the above?

Seb, can you expand further on the problems that come up with non-comprehensive collections? Are you suggesting that museum data is not useful as representative of history because the collection is not comprehensive? If so, I suppose the information we could potentially garner from the collection(s) is chiefly about the collection itself, and not necessarily broader trends (although I’m not sure that’s true). Is this what you are thinking? There are some interesting ideas here, but I’d like to tease out the problem with more detail.

Like Matt, I do think this whole issue raises interesting questions about curation, even beyond just what else we should be curating. Where does selection and preservation come into it when the network almost does these things itself, in a way that is far more comprehensive than what museums can do?

Before getting the to question of how we start to solve these problems from a sector level, we might need to ask if we should. Realistically, is it actually worth anyone’s investment (time/money) to try to make museum cultural data usable when we do have so much other data available through which to make new knowledge. Does museum data need to be part of this?

I would say, yes. Museum data needs to be shared/shareable. I’ve mentioned this before, but our data is such a mess that it’s preventing us from making the best use of it internally. If nothing else, making our data presentable/usable to the outside world will force us to clean it up which will also make it more useful to us as well. I think sharing it is philosophically in line with our missions. And I think cleaning it up to make it shareable is a necessity to make it more usable for us even if we don’t want to share it. The mess is a problem regardless of whether or not we want to share it.

Given the ad hoc nature of previous collecting, much of what we can gather from the collections data is more about how and what we once collected – the assertions of power, the power of the ‘choosing’, the power of ‘representation’. This didn’t matter much when we couldn’t see what we didn’t collect. But now we can see, at least to some extent, because we can finally begin to see what others did.

This isn’t a bad problem to have – we knew that museums weren’t collecting ‘everything’. But perhaps we had more faith in the decision making skills of those who ‘did the collecting for us’. (I’m talking about ‘national’ collections here, not those of wealthy philanthropists that ended up in museums – we knew they weren’t ‘representative’).

Should they be representative? It depends on your institutional mission.

Museums have spoken a lot about using social media and ‘wisdom of the crowds’ to connect with visitors, but perhaps the bigger challenge is to use those same tools to better ‘collect the zeitgeist’. But then again our warehouses are all full already.

Yep – I thought that’s what you were getting at, but just wanted to clarify. I think this is one reason that so much of what I have imagined as the purposes or possible uses for museum datasets is to tell us more about the collection and how it relates to other collections/datasets, and less about history in general per se.

Matt, you make a good point about the mess being a problem regardless of sharing. Are there any other sectors that have faced this kind of issue from a field-wide perspective and come up with a usable solution?

Hi Seb,
I am troubled by your troubles, and I share an abiding interest in these troubles as you know. There is much trouble in this troubled space. Honestly, I think you’re asking too much of museum collection data and may get the benefit of seeing the museum description as just a format. From where I’m sitting at the moment it is just another cultural form (format). I’m seeing it as a distant reading and recording (thanks Franco Moretti) of data we see embedded in the object, in the maker’s notes or recordings of their thinking, in stories and in curatorial files and museum worker minds in museums. Really, the value in museum collection data is in what it offers as a taxonomical classification or as an information resource (a source of derived and encoded knowledge about the collection) to bring together richer information objects. These descriptions are just another set of “texts” to be mined. In real terms that might mean attempting to use museum collection data (as a mineable resource) to power the connections between museum stories. Maybe it is possible to turn multiple sets of museum collection data (as a common pool of information to generate tags from) into something like that which Calais offer and you all run with that? I recall the audible “horror” expressed by the audience of digital librarians at the VALA conference in 2010 when Luke Dearnley talked about the use of this resource to provide intellectual access to the Powerhouse Museum’s online collection search. I could have predicted this reaction, but knew oh so well the value of thinking laterally. Years of being a research librarian trying (with considerable effort) to retrieve useful information for researchers where the categorisation of information was not really neatly aggregated in specialised resources, it was in other resources and it certainly wasn’t obviously coded with domain specific information I could whip it out with. I mined whatever information resource I could get my hands on to rip out information on Australian design history, using whatever search terms I deemed were necessary to shake useful information out of any related boxes of information [read databases]. Maybe you can turn museum collection data into a resource and then use it to connect museum stories that are more human readable and interesting in narrative form, such as an exhibition label or essay, which have a more user-friendly and common form. But, I digress. To come back to my main point, unionising and shimmying data into a common ontology isn’t for wimps (well it’s making my eyes water at the moment). We’re going to do this with all the scholarly cultural datasets in the HuNI data (looking at CIDOC-CRM and FRBR-OO) and that takes some doing (and trust), plus we will be looking at enabling data-out in diverse schema, which I think will be a more important interpretive process. Do you want your people and organisation data in: FOAF or EAC-CPF? But squashing museum data into a shared schema means either being fairly blunt (I’m looking at you Dublin Core) or shoe-horning (I’m thinking MARC here and the use of it for archival description by comparison with using EAD). If people want exactitude for data aggregation and searching, then they share description standards and the implementation of them for specified services etc. If people want to encode what they have in their own way, ideally you take useful learning from others (if you share the same meaning of something) but you also describe in ways that seem meaningful and useful to you. I think this is where we all get horribly blinded and confused about the role of information standards (description). Use description standards where it is meaningful, necessary and useful, and don’t if they aren’t, make your own. I won’t mention museum information management, seems in the too-hard-basket for most, but I still think it’s a critical house-keeping job to do. In the end all that I can offer you for your troubles is to share my latest discovery with you. Why not have a read of Discursive Navigation of Online News by Oliver et al and think about museum collection data as “text”? See the next generation test cases for enabling rich intellectual access to cultural heritage potentially coming out of all that museum collection data. I’d put $10 on that. I’ll send you chocolate fish if I’m wrong.


[…] Museum Datasets (Seb Chan) — collections metadata aren’t generally in good quality (often materials are indexed at the “box level”, ie this item number is a BOX and it contains photos of these things), and aren’t all that useful. The story about the Parisian balcony grille is an excellent reminder that the institution’s collections aren’t a be-all and end-all for researchers. […]

Hi Seb
As a curator I’ve been upgrading content for the museum collections almost exclusively over the last 5 years and I agree archives have a greater degree of granularity in the way they part number and sub-part number objects.

But they – and libraries – have a basic philosophy which has centered around findability rather than story telling. Their granularity often makes them seem more attractive as data-sets for the technologists (large numbers always look better on those kpi’s) but from my experience these same institutions have historically created workflows which do not drive the development of content which captures the story part of their collections, significance, production, etc.

Although this is changing and libraries, and archives are starting to develop more story rich content they are often building this on top of a system, whose roots as a service rather than interpreter go way back in time (thinking here about the role of libraries in Universities), rather than changing the underlying workflows of the system.
Unfortunately this does not mean museums are faring much better when it comes to data-depth. While museums have far more opportunities to develop this content they have created work flows which seem to by-pass the ingestion of the stories they invest so much of their resources to create. I have come across many, many examples where poor data relating to subjects and objects in the database is in stark contrast to the amount of information the museum has created external to the database for these same objects when researching, writing, photographing, and editing for publications and exhibitions.
At the moment I am going through the last 3 years of blogs posted by curators, conservators and registrars and moving them from WordPress back into our KeEmu database. This involves attaching the objects back with their stories and allowing them to be harvested to be online and linked to the rest of the collection. Why? Because when we started the blog our focus was on what people create, and where to place it, rather than how people create and how to build this into capturing it for the database.
I think the Museum’s philosophical attitude is at the core of these workflow issues and as I have said elsewhere – this needs to change as much as developing new data systems, a new exhibition, or delivering content to next new social media platform.

Museums as a part of their mandate are here fro the long haul and we need to to be thinking about this when we create and capture our content rich collection data as much as we do about physically storing and conserving our collection objects.

Leave a Reply

Your email address will not be published. Required fields are marked *