Categories
Digitisation Web 2.0

How to do low cost transcription of hand written and difficult documents

So your museum has already done the easy part of digitisation – taking digital photos of your objects, but now you have a complex hand-written materials you need to digitise . . . what can you do?

This is a question that has popped up in several meetings over recent months.

Our Curator of Information Technology, Matthew Connell, came up with a brilliantly simple solution – and there is no need for the original material to leave your organisation.

With the low cost of MP3 recorders it is very to now record large amount of audio into a single file, already compressed. Take one of these MP3 recorders and ask the expert who is familiar with the document or material requiring digitisation to read the document clearly into the recorder. This may be done over an extended period of time – there is no need to do it all in one go.

When completed, upload the MP3 of clearly spoken audio to a web server. Then use one of several online audio transcription services to transcribe the audio. We have been using such services to get quick, low cost transcriptions of public lectures and and podcasts, and have been impressed with their timeliness and accuracy.

Even factoring in the cost of reading time, this will almost certainly be cheaper and more error free than scanning and transcribing directly from the written original. It also provides significantly more flexibility in terms of pricing as there is a high level of competitiveness amongst audio transcription services at the moment – a level of competition that may not exist amongst specialist written services.

Categories
Conceptual Digitisation

Filtering memory – SEO, newspaper archives, museum collections

When Bad News Follows You in the New York Times (via Nick Carr) is a fascinating article about what can happen when ‘everything’ is put online.

The article looks at the new array of problems that have come about as a by-product of the NYT optimising their site and archives for Google with SEO techniques. Suddenly stories that were either of minor significance, or were in later editions, corrected, are appearing toward the top of Google searches for names, places and events.

Most people who complain want the articles removed from the archive.

Until recently, The Times’s response has always been the same: There’s nothing we can do. Removing anything from the historical record would be, in the words of Craig Whitney, the assistant managing editor in charge of maintaining Times standards, “like airbrushing Trotsky out of the Kremlin picture.”

Whitney and other editors say they recognize that because the Internet has opened to the world material once available only from microfilm or musty clippings in the newspaper’s library, they have a new obligation to minimize harm.

But what can they do? The choices all seem fraught with pitfalls. You can’t accept someone’s word that an old article was wrong. What if that person who was charged with abusing a child really was guilty? Re-report every story challenged by someone? Impossible, said Jonathan Landman, the deputy managing editor in charge of the newsroom’s online operation: there’d be time for nothing else.

(snip)

Viktor Mayer-Schönberger, an associate professor of public policy at Harvard’s John F. Kennedy School of Government, has a different answer to the problem: He thinks newspapers, including The Times, should program their archives to “forget” some information, just as humans do. Through the ages, humans have generally remembered the important stuff and forgotten the trivial, he said. The computer age has turned that upside down. Now, everything lasts forever, whether it is insignificant or important, ancient or recent, complete or overtaken by events.

Following Mayer-Schönberger’s logic, The Times could program some items, like news briefs, which generate a surprising number of the complaints, to expire, at least for wide public access, in a relatively short time. Articles of larger significance could be assigned longer lives, or last forever.

Mayer-Schönberger said his proposal is no different from what The Times used to do when it culled its clipping files of old items that no longer seemed useful. But what if something was thrown away that later turned out to be important? Meyer Berger, a legendary Times reporter, complained in the 1940s that files of Victorian-era murder cases had been tossed.

“That’s a risk you run,” Mayer-Schönberger said. “But we’ve dealt with that risk for eons.”

There are interesting parallels with our experience in making our online collection more usable and accessible. Public enquiries have skyrocketed and now range from the scholarly to the trivial – the greatest increase being in the latter category. Whilst there is a significant amount of extremely valuable piece of object related information sent in by members of the public, there are false leads and material that cannot be adequately verified, and more still that the Museum already knows but has not yet made available online. Managing public expectations and internal workflow is a difficult balancing act and a continuing challenge that many museums that not only put their collections online, but also make them highly accessible, are facing.

Categories
Imaging

Content aware image resizing from Siggraph 07

A common bugbear encountered when working with diverse collections and images is the inability to gracefully created resized versions. We have never found a suitable solution to creating thumbnails of our collection for the OPAC and Design Hub – the current solution is to take the existing large image, resize it to be 500 pixels on the longest side, then take a square from the middle 400 pixels, and resize the square to 80×80 pixels, including any white space borders. This is run as a batch process. Whilst this works for most rectangular images it still has the unintended side effect of lopping off heads and feet, and on rare irregular shapes such as very long artworks, the thumbnail is virtually useless even for quick object recognition tasks.

Here are some examples –

But here, in a presentation from Siggraph 07, is a fascinating potential solution. It is quite amazing and by reducing or expanding an image based on ‘content’ has very interesting implications for intellectual property legislation. In many ways it does what MP3 compression does (poorly) for audio, intelligently remove the bits of the image that are least recognised by the viewer. In so doing it makes assumptions about the overall image – and how we ‘see’ images.

Categories
Interactive Media Web 2.0

The new Google Maps, Google Earth and Google Sky

Everyone is buzzing about the new features that have popped up with the easily embeddable GoogleMaps today. This is a big step towards making map mashups completely mainstream – increasing the popular acceptance of the map as a user interface.

For a look at how things might work for the museum and cultural sector take a look at this query. Scroll to the bottom and you will see a map showing all the places mentioned in the book, together with pop up page references! There’s obviously been a lot of parsing of OCRed text to pull out the place names but the result is pretty incredible.

Something a few have missed is the astronomy features now available in Google Earth called Google Sky.

Download the new version of Google Earth and you will find a new toolbar icon that toggles between Earth and Sky. Once in Sky mode you can find galaxies, constellations and planets – all of which link to data from NASA and other sources including Hubble telescope pictures. It is very impressive and lots of fun.

Next task is to look into making KML files to accompany our monthly night sky guide podcasts at the Sydney Observatory . . .

Categories
Museum blogging Web 2.0

Blogs as a ‘community strategy’

New Matilda has an short but interesting piece by Kevin Anderson, blogs editor at The Guardian. In the article he stresses that blogging is about generating and engaging the community, not just a new means of publishing. Rather than see blogging as a threat to traditional publishing, it should be viewed as a new strategy for engaging audiences and readers.

This has strong resonances with experiences of museum blogging. Blogs aren’t replacing traditional forms of official communication, but they are engaging audiences in new and effective ways.

Neil McIntosh and Jack Schofield launched The Guardian’s first blog in 2001, realising it was better to be part of the conversation than listen to it from a lofty perch. The Guardian now has blogs covering everything from currents affairs — on ‘Comment is Free’ — to sport, arts and culture, and most recently food and gardening.

But blogging is not a publishing strategy, it’s a community strategy. Being one of the world’s bloggiest newspapers has led to bloggers linking to our stories, helping us grow a grass-roots following in the United States, so that The Guardian now has more online visitors outside of the UK than inside.

One of The Guardian’s stated goals is to become the world’s leading liberal voice. And our website’s ‘Head of Communities and User Experience,’ Meg Pickard, has said that we also need to enable the world’s liberal voices.

The art of blogging is about building a community and coaxing people out from behind their keyboards.

Categories
Web 2.0 Wikis

Wikipedia, Wikiscanner, revealing the hidden power struggles over knowledge production

Last week featured a rather robust debate in the office about whether museums should encourage the use of Wikipedia, and, perhaps participate in adding and editing entries themselves. Now most Fresh + New readers will be familiar with the arguments – they’ve been around since Wikipedia began.

Of course what most anti-Wikipedians, if they don’t dismiss it outright, claim is that ‘Wikipedia is only as good as its last edit’. But to me that is missing the point. Wikis, and Wikipedia as an example of a wiki, are interesting because they reveal the history of edits, changes, revisions and re-versions. They reveal the collaborative and argumentative nature of knowledge production.

Well, almost as if to prove my point, along comes Virgil Griffith’s Wikiscanner which has gotten coverage in Wired and is struggling under the burden of the resultant high traffic load.

Wikiscanner basically matches the IP addresses of those doing edits with information about their network provider – known IP address ranges of government departments, corporations and the like. By doing this Wikiscanner is beginning to reveal the complex web of individuals, and increasingly, corporations that are using Wikipedia to argue and dispute versions of the ‘truth’. You can start to get an idea of the otherwise hidden agendas and power struggles over knowledge and information quite quickly . . . .

Griffith says he launched the project hoping to find scandals, particularly at obvious targets such as companies like Halliburton. But there’s a more practical goal, too: By exposing the anonymous edits that companies such as drugs and big pharmaceutical companies make in entries that affect their businesses, it could help experts check up on the changes and make sure they’re accurate, he says.

Categories
Conceptual Web 2.0 Web metrics

Valuing different audiences differently – usability, threshold fear and audience segmentation

It is important to realise that to deliver more effective websites we need to move away from a one-size-fits-all approach not only when designing sites but also when evaluating and measuring their success. We know that some online projects are specifically intended to target specialist audiences – a site telling the histories of recent migrants might require translation tools, and a site aimed at teenagers might, by design, specifically discourage older and younger audiences in order to better attract teenage usage.

Remembering, too, that some key museum audiences (regional, remote, socially disadvantaged) may have no online representation in online visit figures, and others may have limited and sporadic online interactions, because of unequal internet access, it is important to look at the overall picture of museum service delivery. Some audiences cannot be effectively engaged online. Others still may only feel confident engaging in online conversations about the museum using non-museum services – as I’ve written before – on their own blogs, websites, and social media sites.

If we acknowledge ‘threshold fear’ in our physical institutions, then we need to realise this applies online as well. The difference being that in the online world there are many many more less ‘fearful’ options to which potential visitors and users can easily flee. The ‘back’ button is just a click away.

The measure of the ‘value’ of visitors therefore need to differ across parts of the same website. We may need to form different measures for a user in the ‘visiting the museum’ part of the website to the ‘tell us your story’ section, even though in one visit they might explore both areas. Likewise, a museum visitor who blogs about their positive experience of a real world visit on their own family blog might be considered. Or a regionally-oriented microsite that gets discussed on a specialist forum might be more valuable – to that particular project – than a posting on a more diffused national discussion list.

Visit-oriented parts of the the website should be designed and created with known target audiences in mind, understanding that not everyone can visit the museum, and their success measured accordingly. It might be sensible to attempt to address ‘threshold fear’ by using images of the museum that are more people-oriented rather than object-oriented in order to promote the notion that the museum is explicitly a place for people.

When we were building our children’s website we specifically decided against creating a resource for ‘all’ children – that would have resulted in a too generic site – and targeted the pre- and post- visit needs of a known subset of visitors with children. We don’t actively exclude other visitors (other than through language choice, visual design, and bandwidth requirements), but we have actively attempted to better meet the needs of a subset of visitors. This subset will necessarily diversify over time, but we also understand that out on the internet there are plenty of other options for children.

The problem with traditional measurements are that every visitor to our online resources is homogenised into single figures – visits, time spent, pages viewed. Not only does this reduce the value of the web analytics, it does the visitor a great disservice. Instead, good analytics is about segmentation. This can be segmentation based on task completion and conversions, and understanding visit intentions.

So who is a ‘valuable’ visitor?

It depends on context.

For our children’s site we place a greater internal value on those who complete one of two main site conversions – spending a particular amount of time on the visit information areas; and second, those who browse, find, and most critically, download an offsite activity. Focussing in on these subsets of users allows us to implement evaluation and tracking. For those who complete the visit-related tasks we might offer discount coupons for visiting and track virtual to real-world conversions. What proportion of online visitors who look at visit information actually convert their online interest to a real world action? And in what time frame (today, this week, this month?). Of the second group we may conduct evaluation of downloader satisfaction – did they make they craft activity they downloaded? Was it too hard, too easy? Did they enjoy the experience?

What of the others who visit the children’s site? They are a potential audience who have shown an interest but for many reasons haven’t ‘converted’ their online visit. We can segment this group by geography and origin – drill down deeper and really begin to examine the potential for them to ever ‘convert’.

Other parts of our website – say our SoundHouse VectorLab pages – we may see as valuable users who simply use and linkback to our ‘tip of the day’ resources. Despite being primarily an advertisement for onsite courses run in the teaching labs, we do see a great value in having our ‘tip of the day’ resources widely read, the RSS feed subscribed to, and articles linked back to. However this has to be a secondary objective to actually taking online bookings for courses.

Postscript – I’d also suggest reading the 2004 Demos report ‘Capturing Cultural Value’ for some important philosophical and practical caveats.

Categories
Collection databases Web 2.0

OPAC2.0 – Latest features update

We’ve added a whole range of new features to our OPAC that we think further enhance its usability.

Tooltips

Each ‘feature’ on the search results and object view pages now has an explanatory tooltip. Given the OPAC has become quite complex and there is a lot going on on the screen now, we felt CSS tooltips offered a more practical solution than a ‘help’ screen or more text in the form of user documentation. More tooltips will be added this week to explain museum-centric language like ‘statement of significance’.

Failed search suggestions

Now when a search term is misspelled or return no result our system generates a series of possible ‘alternatives’. This is generated on the fly using a calculation called Levenshtein distance. This cycles through each letter of the misspelt word and then queries our table of successful searches for possible matches. These are then ranked and the top 8 variants are presented to the user. In order to make this reasonably quick we have had to rebuild quite a bit of our search technology.

Opensearch RSS with thumbnails

About two months ago our Opensearch feed was updated to include thumbnails in search results. We added the thumbnails to ensure that our feed delivered optimal results to the National Library of Australia’s Libraries Australia search. We also use this modified RSS to drive search results of Design Hub.

Categories
Conceptual Interactive Media Museum blogging Web metrics

Authority in social media – Why We Twitter: Understanding Microblogging Usage and Communities

From Akshay Java, Xiaodan Song, Tim Finin, and Belle Tseng comes an interesting academic paper titled Why We Twitter: Understanding Microblogging Usage and Communities.

Following my recent post looking at diffused brand identity in social media, this paper is a useful examination of the emergent ‘authority’ and ‘connectedness’ of users amongst a dataset of 75,000 users and 1.3 million ‘posts’.

Twitter is something that I’ve seen limited potential for in most museum applications so far, but increasingly Twitter-style communciation is replacing email – see the frequent updates that your friends do on Facebook’s ‘what I am doing/feeling now’ mood monitor for example.

Abstract:

Microblogging is a new form of communication in which users can describe their current status in short posts distributed by instant messages, mobile phones, email or the Web. Twitter, a popular microblogging tool has seen a lot of growth since it launched in October, 2006. In this paper, we present our observations of the microblogging phenomena by studying the topological and geographical properties of Twitter’s social network. We find that people use microblogging to talk about their daily activities and to seek or share information. Finally, we analyze the user intentions associated at a community level and show how users with similar intentions connect with each other.

Categories
Social networking Web 2.0 Web metrics

Social media measurement – brand awareness and trust in the cultural sector

There has been a flurry of activity amongst web analytics companies and in the marketing world to devise complex ways of measuring social media activity. As much of this interest in devising a way of measuring and comparing social media ‘success’ comes down to monetising social media activity through the sale of advertising, these measures don’t easily translate to the cultural sector. Advertisers are after a ‘ratings’ system to compare the different ‘value’ of websites but as we know from old media (TV and radio), ratings don’t work well for public and community broadcasters who don’t sell advertising and have other charters and social obligations to meet.

We know that visits, page views and time spent aren’t the best ways of understanding our audiences or their levels of engagement with our content, and with social media it is all about engagement. If we aren’t selling advertising space to all those eyeballs focussing their attention on our rich and engaging content, then what are we trying to do?

I’d argue that it is about brand awareness. Not just brand awareness in terms of being top of mind when geographically close audiences are thinking of a cultural activity to do in their leisure time, but about linking the perceived authenticity of the information contained on your website to your brand. More and more there is ongoing research into how museums are perceived as ‘trusted’ information sources, and importantly politically impartial sources. But this perception relies upon an awareness on the part of the online visitor that they are indeed on a museum website.

This user awareness is, I argue, not a given, especially now that such a large proportion of our online traffic comes via search. Looking into the future, search will be an even greater determinant of traffic, even if your real-world marketing prominently displays your URL (as it should be doing by now!). Looking at your real world marketing campaigns around your URL you will probably find a spike in direct traffic but a similarly sized spike in brand name searches – we are finding this with the Sydney Design festival at the moment. The whole of Sydney is covered with street advertising from bus shelter posters to street banners, all promoting the URL. The resulting traffic is a mix of direct and brand name search based.

The problem is, now, the brand no longer is just represented in the online environment on our own websites.

One of the first things I talk about in my workshops and presentations is that even if your organisation is not producing social media about yourself, then your audiences almost certainly are. If you aren’t aware of what your audiences are saying about you, what they are taking photos of, or recording on their camera phones, then you are missing a unique opportunity to understand this generally highly engaged tip of your audience.

It is possible those who blog about their experience in your organisation, upload their photos and videos, are going to be those who are potentially your most (commercially) ‘valuable’ customers – high disposable income, high levels of interest and a desire to participate and communicate/advocate to others about your organisation.

They are probably the most likely to climb the ‘ladder of engagement’ from potential visitors through regular visitors to members and finally donors/sponsors. They may not always have positive things to say, but by hearing their gripes and grizzles, you are able to understand and address issues that impact how your organisation is going to be promoted through word-of-mouth. And word-of-mouth is going to almost always be the most ‘trusted’ type of marketing recommendation.

So how do we track these conversations that occur publicly but not on your organisation’s website?

Mia Ridge recently pointed to a great summary of the easiest to use ‘ego search’ tools and methods by which you can easily keep track of your audience conversations. Another favourite of mine for small scale tracking is EgoSurf.

Sixty Second View has compiled an ‘index’ of how these kinds of ego search results might be compiled to generate a figure to compare with competitors and other organisations. Their methodology, whilst very complex, focusses on assessing how connected the people who are talking about you actually are – this allows for a determination of effective reach, and the trust that may be accreted to those in the conversation.

(top level summary mine only)

a) Blogs that are talking about you – what are their Technorati rankings, how high are their Google PageRanks, how many BlogLines subscribers do they have etc

b) Multi-Format conversations – how popular/connected are the Facebook and MySpace people who are talking about your organisation

c) Mini-Updates – frequency and reach of Twitters

d) Business Cards – LinkedIn connectedness

e) Visual – Flickr influence and popularity can be used to determine how connected and visible posters images of your organisation are. This can be applied to YouTube as well.

f) Favourites – Digg, Del.icio.us connectedness

This approach is useful as it provides a detailed analysis of the spectrum of social media that your organisation is probably already represented in. It can reveal areas where your users are’nt talking about you, and it can illuminate areas of your own site that receive unexpected user attention. Not only that it focuses on who is talking about you. On the downside, it is a lot of work – but in undertaking even a cut down version of this methodology it will force you to examine the different impacts of types of social media.

For example, are all blog posts about your organisation equal? When you check the Technorati rankings of the commenting blogs you will find that some have greater reach and authority than others. The real world equivalent here is the different weightings your marketing team probably already gives to print media mentions in national broadsheets versus local weeklies; or the difference between a TV editorial and a local radio mention.

Is this really the job of the web team?

Unless your organisation has a marketing team that is expert in online marketing then the answer must be yes. Web analytics in five years time will be all about measuring offsite activity.