Technical transcription for ASR: helping machines learn

For many of us, voice powered technology has become part of our daily lives. But making sure Automatic Speech Recognition (ASR) is as accurate as possible presents an ongoing technical challenge. Appen’s technical transcripts are playing a key role in providing machines with the data they need to understand the human voice better. Work first began in the 1950s to develop a way of enabling machines to recognise the different patterns and sounds of the human voice. However, it was not until the 1990s that real technical advances were made; even then, the practical applications of ASR were not clear. Today, ASR is becoming increasingly sophisticated, and we’re discovering more and more ways to use it, from telling a machine to carry out a specific action to authenticating the identity of a speaker. The biggest challenge is to ensure ASR systems are as accurate as possible, and that even when confronted with an unusual accent or a rare dialect they can still ‘understand’ and respond to what is being said. In practice this means machines must recognise and process an almost immeasurably vast amount of data. And to do this, just like humans, they need to be trained. That’s where Appen’s technical transcripts come in. Technical transcripts are very different from transcripts for official records (the verbatim written accounts Appen provides for the police, coroner or courts), from the way they are created to the way they are used. Simon McCartin, one of Appen’s Linguistic Project Managers, explains: “Just as with an official record transcript, we start with an audio recording,” he says. “But with a technical transcript the first step is to clean up the audio, removing any sounds that are not usable. The audio is then broken up into short sections of between five and nine seconds, called ‘utterances’.” These ‘utterances’ can then be annotated to further identify, eliminate or group together sounds the client does or doesn’t need. They can also be time stamped, to make it easier to find specific types or patterns of speech, for example a Cockney accent or the Fife pronunciation of the letter ‘j’. “The final stage is to produce a text file, checked for quality and spell-checked for consistency, that is then converted into whatever format the client requires, says Simon. “The resulting transcript looks quite unlike the kind of verbatim official transcript that might be used in a courtroom.” Technical transcribers need specific skills; official records transcribers don’t normally edit audio files. But close attention to detail, an expert knowledge of grammar and strict adherence to information security protocols are requirements for both disciplines. At Appen, some transcribers work across the two. As the technology underpinning ASR develops and the practical applications become clearer, Appen is providing technical transcription services to an ever-expanding range of clients. It is also working with partners all over the world to collect the raw data (audio recordings) in languages other than English. Whilst the ideas behind ASR might sound complex, put simply, the more data a machine processes the better it gets at recognising the quirks and anomalies of human speech. So, next time you ask your smart home device to play a tune or answer a question, remember, an awful lot of human skill went into helping the machine get it right. Learn more about Appen’s technical transcription services here.

Leave a Reply