Categories
Developer tools Digitisation Web 2.0

Stop spam and help correct OCR errors – at the same time!

reCaptcha is a nifty project that uses the now familiar ‘Captcha’ web form spam prevention technique to help fix OCR problems in global digitisation projects.

Currently this great example of socially responsible crowdsourcing is helping fix digitisation errors and inconsistencies in books scanned for the Internet Archive – books that will be reproduced in the developing world through projects like the Million Book Project.

If you are considering (or already use) a Captcha tool on your website or blog you might consider swapping over to reCaptcha so that your users, when submitting comments, aren’t just keeping your site free of spam but they are also helping fix digitisation for others.

There are downloadable plugins for WordPress, mediaWiki, PHPbb, as well as a general PHP class, and a range of APIs to choose from for easy implementation in projects.

Here’s the project blurb –

About 60 million CAPTCHAs are solved by humans around the world every day. In each case, roughly ten seconds of human time are being spent. Individually, that’s not a lot of time, but in aggregate these little puzzles consume more than 150,000 hours of work each day. What if we could make positive use of this human effort? reCAPTCHA does exactly that by channeling the effort spent solving CAPTCHAs online into “reading” books.

To archive human knowledge and to make information more accessible to the world, multiple projects are currently digitizing physical books that were written before the computer age. The book pages are being photographically scanned, and then, to make them searchable, transformed into text using “Optical Character Recognition” (OCR). The transformation into text is useful because scanning a book produces images, which are difficult to store on small devices, expensive to download, and cannot be searched. The problem is that OCR is not perfect.

reCAPTCHA improves the process of digitizing books by sending words that cannot be read by computers to the Web in the form of CAPTCHAs for humans to decipher. More specifically, each word that cannot be read correctly by OCR is placed on an image and used as a CAPTCHA. This is possible because most OCR programs alert you when a word cannot be read correctly.

But if a computer can’t read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here’s how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.

Categories
Copyright/OCL Digitisation MW2007 Web 2.0

M&W07 – Day two: Brewster Kahle

Museums & the Web is very big this year. There must be nearly 1000 people here and there is a good buzz in between sessions.

Today opened with an entertaining and motivational opening plenary from Brewster Kahle, founder of the Internet Archive. Kahle talked about the Internet Archive disucssing the various types of media it is digitising and making openly accessible, for free, using open standards. The big stumbling block is rights.

Starting with books he gave some interesting figures on digitisation costs. The archive is scanning 12,000 books per month over three locations (USA, Canada and the UK). It costs about $0.10 per page to do scanning, OCR, PDFing, file transfer and permanent storage (forever). Distribution problems are being solved by print on demand which costs as little as $0.01 per page and is being rolled out through mobile digital book buses in Uganda, India and China with print on demand. Kahle handed around some samples of the print on demand titles and they were of acceptable quality and had proper covers. He also handed around one of the 300 prototype $100 laptops from MIT which was pretty cool with a great hi-res screen which makes the concept of a low-cost, developing-world-friendly e-book reader viable.

Audio recordings are costing $10 per CD or roughly $10 per hour of recording. Internet Archive will host forever, and for free. Video recordings are slightly more at $15 per hour. They have also been recording broadcast television, 20 channels worldwide, 24/7. Only one week is available online so far – that of 9/11. They have also started on software archiving but are stymied by the DMCA.

The Wayback Machine (web archive) is snapshotting every two months at 100 terrabytes of storage per snapshot. Interestingly he quoted the average webpage changes or is deleted every 100 days making regular archiving critical.

Kahle emphasised the importance of public institutions doing digitisation in open formats rather than the exclusivity of GoogleBooks deals. His catchall warning for museums was “public or perish” which is a great start to the conference.

Categories
Collection databases Digitisation Web 2.0 Young people & museums

Demspey on ‘getting with the flow’, Morville on ‘findability’

OCLC’s Lorcan Dempsey’s idea of libraries “getting with the flow” (from 2005) is something that has resonated well beyond the library world.

The importance of flow underlines recurrent themes:

– the library needs to be in the user environment and not expect the user to find their way to the library environment

– integration of library resources should not be seen as an end in itself but as a means to better integration with the user environment, with workflow.

Increasingly, the user environment will be organized around various workflows. In fact, in a growing number of cases, a workflow application may be the consumer of library services.

For libraries, as evidenced also in the discussions by Holly Witchey at Musematic who has been covering the Webwise IMLS conference with regular session reports, and Guenter Waibel from RLG’s follow-up commentary, libraries are at a far more pointy end of changes in customer/user behaviour than most museums. Waibel raises the very hefty 290 page OCLC report titled Perceptions in which the survey suggests 84% of general users begin an information search with a search engine, and only 1% with a library website (PDF page 35/1-17). If conducted again now I would expect Wikipedia to rate highly.

Libraries are seen as more trustworthy/credible and as providing more accurate information than search engines. Search engines are seen as more reliable, cost-effective, easy to use, convenient and fast. (PDF page 70/2-18)

Where are museums in this? Is your content in the “flow”? Do users need to come to your site to your onsite search to be able to find it? If so, they are probably going to look elsewhere first, if they haven’t already.

Over at the University of Minnesota they have just held the CLC Library Conference titled “Getting In The Flow” with Dempsey as one of the speakers. There are some great summaries of the presentations including slides over in their conference blog.

Other than Dempsey one of their speakers was Peter Morville who some readers may remember from his first O’Reilly book Information Architecture for the World Wide Web, or the less technically oriented
Ambient Findability (which has been doing the rounds of the office for the past 9 months).

Morville’s presentation slides are an excellent introduction to his work and given their tweaking for the library/information-seeking context are very useful for those in museums too. Ellysa Cahoy has some notes taken during the presentation at the CLC blog as well for the slides that aren’t immediately self-explanatory.

Categories
AV Related Digitisation Web 2.0

Testing podcast transcription – Casting Words

Audio transcription is an essential part of digitisation. Our curatorial researchers are recording thousands of hours of interviews with subjects onto a mix of analogue (tapes) and digital (MP3/WAV) media. These oral histories are filed away for preservation purposes but will remain almost unusable in any serious way until they are digitised – that is, transcribed into a searchable machine readable format.

Likewise, we record many events at the museum and in the last few years have begun offering them as podcasts on our websites.

Last week we tried out a service called Casting Words. Casting Words is a transcription service that offers to send back a transcription of any podcast or audio file, quickly and cheaply.

Generally transcribing podcasts, especially those of live talks and events, has been an arduous task, one that even with the best of intentions often doesn’t happen. Transcription has tended to be expensive and time consuming. It has also been typically inaccurate.

Yet without a transcription the contents of the podcast are rendered invisible to all but the most dedicated internet user – who already knows of the podcast’s existence. This is because a transcript not only serves the interests of vision-impaired users and those wanting to skim read before downloading, it also exposes the content of the podcast to search engines thus aiding discoverability.

Here are the results from our test of Casting Words.

TEST 1 – The Sydney Observatory February 2007 night sky guide – this recording runs for about 14 minutes and has one speaker talking throughout. Whilst not explicitly technical and aimed at a general audience it is about constellations and uses common astronomical terms. It is recorded in a quiet room with no background noise and is edited in post-production to a script.

Original MP3 – listen at the Sydney Observatory blog
Transcript – view online
Time taken to transcribe – 24 hours (from submission to receipt of finished product)
Cost of transcription – US$10.50

TEST 2 – The live recording of a D-Factory public talk titled “Pop-ups, fold-outs and other design adventures” – this recording runs for 61 minutes and consists of four individually microphoned speakers recorded into a single stationery video camera via a live mixing desk feed. The same audio feed is used as the signal to the PA system in the room. As a result of the room acoustics, mixdown and speaker behaviour each speaker’s voice is inconsistently recorded and the recording itself fluctuates in volume. The talk is set up in the manner of a studio interview – a little like a TV chat show – it is totally unscripted and the recordings have no post production editing. There is background noise present throughout.

Original MP3 – listen or view video online (3rd item, right column)
Transcript – read online at Design Hub
Time taken to transcribe – 48 hours
Cost of transcription – US$45.50

How accurate are the transcriptions?

Other than American spellings, the transcripts are very accurate. In the Sydney Observatory transcript there was one numerical error that has been corrected. The DFactory transcript is a little more difficult to check but there does not seem to be significant errors – which, given the original recording quality, is surprising. There is one instance where the transcriber has noted that all the speakers were speaking at once and thus no transcript was available for those few seconds.

How does Casting Words work and why is it so cheap?

Casting Words uses Amazon’s Mechanical Turk to divide complex work into a small tasks which are advertised for freelancers (turkers, as they are known) to perform – anywhere, anytime in the world. There are some tasks that humans perform better than machines and Amazon’s Mechanical Turk uses its machines to allocate these tasks more efficiently. The name ‘mechanical turk’ comes from a (in)famous hoax by Hungarian baron Wolfgang von Kempelen in the late 18th century.

Turkers who undertake Casting Words transcription tasks are not unqualified. Each has to undertake a small qualification task, and their rate of payment depends upon their qualification level. Also, each transcription is edited and checked as separate tasks. There seem to be about 10,000 qualified trascribers and 3500 qualified editors.

Salon.com did a report on Mechanical Turk mid last year which interviewed Casting Words who explain how it works.

With a little code, plus the turkers, it has succeeded in basically automating the process. The company charges its customers from 42 cents a minute for podcast transcription to 75 cents a minute for other audio. CastingWords pays Mechanical Turk workers as little as 19 cents a minute for transcription. If a transcription job is posted on Mechanical Turk for a couple of hours at the rate of 19 cents a minute, and no worker has taken on the project, the software simply assumes the price is too low and starts raising it.

After a transcription assignment is accepted by a worker, and completed, it goes back out on Mturk.com for quality assurance, where another worker is paid a few cents to verify that it’s a faithful transcript of the audio. Then, the transcript goes back on Mturk.com a third time for editing, and even a fourth time for a quality assurance check. “It’s been terribly useful for us,” says Nathan McFarland of Seattle, one of the co-founders of CastingWords. Transcription is the type of relatively steady task that keeps turkers with good ears who are fast typists coming back. “There are people who have been with us for months, and they’re not leaving,” says McFarland.

The article is essential reading as it also explores the criticisms of Mechanical Turk – the nature of labour allocated under this system, the pay rates and worker agreements, and the question raised by many people who do the work, “do they actually consider it as ‘work’?”. Much of the other tasks done by turkers are micro-tasks – very short, quick tasks such as image tagging, or trivia quiz answering.

The demand for transcription is only going to increase. Each month we are recording more and more in digital form, and the demand for it to be made searchable (which is one of the reasons we digitise in the first place) gets stronger and stronger. What other services have other museums tried to deal with this media overload?

Categories
Digitisation Interactive Media

Computer game history

Fascinating archive project from venerable US gaming magazine Computer Gaming World puts archives of issue 1 (1981) through to 1992 online as PDFs. It has obviously been an enormous scanning and digitisation project.

This is a great trip down memory lane and is an insight into not only how games have developed, but also how computer game audiences and advertising has changed, along with criticism and review.

Issue 1 has an amusing piece of the future of gaming – will 16K of memory be enough?

I hope they continue to release back issues for 1992-2006 at a later date.

Categories
Digitisation Interactive Media Web 2.0

Collections Council Australia – Digital Collections Summit presentation (17/8/06)

Yesterday at the Digital Collection Summit in Adelaide I presented a short 5 minute overview of our OPAC2.0 and Design Hub projects followed by Dr Fiona Cameron introducing the upcoming theoretical research into Design Hub impacts.

Quite a few people have asked for a copy of the presentation – so here it is. Unfortunately it doesn’t have the witty banter and arm-waving/finger-pointing that accompanied the ‘real life’ version.

If you would like more information on these projects then please get in touch.

There are several other posts here that cover some of the current and emerging trends in usgae of our OPAC2.0 which provide some extra reading.

Download Powerpoint show

Categories
Digitisation Interactive Media Web 2.0

GoogleMaps gaming

GoogleMaps plus gaming –

Goggles : a flight simulator using GoogleMaps as the terrain!
Endgame : real-time strategy wargaming using GoogleMaps

Categories
Digitisation General Metadata

Meta-Media

There’s a really interesting article here from ctheory.net written by our old mate Lev Manovich that looks at ‘understanding meta-media’ and examines “what new media does to old media?” focusing particularly on the idea of simulation.
The article references some great new media works that explore the concept of ‘mapping’ as key framework for undertsanding the intersection.

“This is not accidental. The logic of meta-media fits well with other key aesthetic paradigms of today — the remixing of previous cultural forms of a given media (most visible in music, architecture, design, and fashion), and a second type of remixing — that of national cultural traditions now submerged into the medium of globalization. (the terms “postmodernism” and “globalization” can be used as aliases for these two remix paradigms.) Meta-media then can be thought alongside these two types of remixing as a third type: the remixing of interfaces of various cultural forms and of new software techniques — in short, the remix of culture and computers”

Categories
AV Related Digitisation

Cylinder Audio Archive

Cylinder recordings, the first commercially produced sound recordings, are a snapshot of musical and popular culture in the decades around the turn of the 20th century. They have long held the fascination of collectors and have presented challenges for playback and preservation by archives and collectors alike.

With funding from the Institute of Museum and Library Services, the UCSB Libraries have created a digital collection of over 5,000 cylinder recordings held by the Department of Special Collections. In an effort to bring these recordings to a wider audience, they can be freely downloaded or streamed online.

We need more public domain digitisation projects like this.

Categories
Copyright/OCL Digitisation

Google Print Podcast from OpenSource Radio

If you have a long journey home then download this great 1hr podcast on the Google book project. It is a good discussion and covers some essential areas around the privatisation of knowledge and the Copyright conflict over Google’s activities.