Testing podcast transcription – Casting Words

Audio transcription is an essential part of digitisation. Our curatorial researchers are recording thousands of hours of interviews with subjects onto a mix of analogue (tapes) and digital (MP3/WAV) media. These oral histories are filed away for preservation purposes but will remain almost unusable in any serious way until they are digitised – that is, transcribed into a searchable machine readable format.

Likewise, we record many events at the museum and in the last few years have begun offering them as podcasts on our websites.

Last week we tried out a service called Casting Words. Casting Words is a transcription service that offers to send back a transcription of any podcast or audio file, quickly and cheaply.

Generally transcribing podcasts, especially those of live talks and events, has been an arduous task, one that even with the best of intentions often doesn’t happen. Transcription has tended to be expensive and time consuming. It has also been typically inaccurate.

Yet without a transcription the contents of the podcast are rendered invisible to all but the most dedicated internet user – who already knows of the podcast’s existence. This is because a transcript not only serves the interests of vision-impaired users and those wanting to skim read before downloading, it also exposes the content of the podcast to search engines thus aiding discoverability.

Here are the results from our test of Casting Words.

TEST 1 – The Sydney Observatory February 2007 night sky guide – this recording runs for about 14 minutes and has one speaker talking throughout. Whilst not explicitly technical and aimed at a general audience it is about constellations and uses common astronomical terms. It is recorded in a quiet room with no background noise and is edited in post-production to a script.

Time taken to transcribe – 24 hours (from submission to receipt of finished product)
Cost of transcription – US$10.50

TEST 2 – The live recording of a D-Factory public talk titled “Pop-ups, fold-outs and other design adventures” – this recording runs for 61 minutes and consists of four individually microphoned speakers recorded into a single stationery video camera via a live mixing desk feed. The same audio feed is used as the signal to the PA system in the room. As a result of the room acoustics, mixdown and speaker behaviour each speaker’s voice is inconsistently recorded and the recording itself fluctuates in volume. The talk is set up in the manner of a studio interview – a little like a TV chat show – it is totally unscripted and the recordings have no post production editing. There is background noise present throughout.

Time taken to transcribe – 48 hours
Cost of transcription – US$45.50

How accurate are the transcriptions?

Other than American spellings, the transcripts are very accurate. In the Sydney Observatory transcript there was one numerical error that has been corrected. The DFactory transcript is a little more difficult to check but there does not seem to be significant errors – which, given the original recording quality, is surprising. There is one instance where the transcriber has noted that all the speakers were speaking at once and thus no transcript was available for those few seconds.

How does Casting Words work and why is it so cheap?

Casting Words uses Amazon’s Mechanical Turk to divide complex work into a small tasks which are advertised for freelancers (turkers, as they are known) to perform – anywhere, anytime in the world. There are some tasks that humans perform better than machines and Amazon’s Mechanical Turk uses its machines to allocate these tasks more efficiently. The name ‘mechanical turk’ comes from a (in)famous hoax by Hungarian baron Wolfgang von Kempelen in the late 18th century.

Turkers who undertake Casting Words transcription tasks are not unqualified. Each has to undertake a small qualification task, and their rate of payment depends upon their qualification level. Also, each transcription is edited and checked as separate tasks. There seem to be about 10,000 qualified trascribers and 3500 qualified editors. did a report on Mechanical Turk mid last year which interviewed Casting Words who explain how it works.

With a little code, plus the turkers, it has succeeded in basically automating the process. The company charges its customers from 42 cents a minute for podcast transcription to 75 cents a minute for other audio. CastingWords pays Mechanical Turk workers as little as 19 cents a minute for transcription. If a transcription job is posted on Mechanical Turk for a couple of hours at the rate of 19 cents a minute, and no worker has taken on the project, the software simply assumes the price is too low and starts raising it.

After a transcription assignment is accepted by a worker, and completed, it goes back out on for quality assurance, where another worker is paid a few cents to verify that it’s a faithful transcript of the audio. Then, the transcript goes back on a third time for editing, and even a fourth time for a quality assurance check. “It’s been terribly useful for us,” says Nathan McFarland of Seattle, one of the co-founders of CastingWords. Transcription is the type of relatively steady task that keeps turkers with good ears who are fast typists coming back. “There are people who have been with us for months, and they’re not leaving,” says McFarland.

The article is essential reading as it also explores the criticisms of Mechanical Turk – the nature of labour allocated under this system, the pay rates and worker agreements, and the question raised by many people who do the work, “do they actually consider it as ‘work’?”. Much of the other tasks done by turkers are micro-tasks – very short, quick tasks such as image tagging, or trivia quiz answering.

The demand for transcription is only going to increase. Each month we are recording more and more in digital form, and the demand for it to be made searchable (which is one of the reasons we digitise in the first place) gets stronger and stronger. What other services have other museums tried to deal with this media overload?