What is Automatic Speech Recognition?

Advances in AI in conjunction with the global pandemic have motivated businesses to enhance their virtual interactions with customers. Increasingly, they’re turning to virtual assistants, chatbots, and other speech technology to power these interactions efficiently. These forms of AI rely on a process known as Automatic Speech Recognition, or ASR. ASR involves the conversion of speech into text; it enables humans to speak to computers and be understood. ASR is experiencing a rapid rise in usage. In a recent survey by Deepgram in partnership with Opus Research, 400 North American decision-makers cross-industry were asked about ASR usage at their companies. 99% claimed they’re currently using ASR in some capacity, typically in the form of voice assistants in mobile apps, which speaks to the importance of this technology. As ASR technology advances, it’s become an increasingly attractive option for organizations looking to better service their customers in a virtual setting. Learn more about how it works, where it’s best used, and how to overcome common challenges when deploying AI ASR models.

How Automatic Speech Recognition Works

ASR has come a long way in the last few decades thanks to the power of AI and machine learning algorithms. More basic ASR programs today still use directed dialogue, while advanced versions leverage the AI subdomain of natural language processing (NLP).

Directed Dialogue ASR

You may have experienced directed dialogue when you’ve called your bank. For larger banks, you’ll usually have to interact with a computer before you reach a person. The computer may ask you to confirm your identity with simple “yes” or “no” statements, or to read out the digits in your card number. In either case, you’re interacting with directed dialogue ASR. These ASR programs only know short, simple verbal responses and in turn have a limited lexicon of responses. They’re useful for brief, straightforward customer interactions but not for more complex conversations.

Natural Language Processing-based ASR

As mentioned, NLP is a subdomain of AI. It’s the method for teaching computers to understand human speech, or natural language. In the simplest of terms, here’s a general overview of how a speech recognition program leveraging NLP can work:

You speak a command or ask a question to the ASR program.
The program converts your speech into a spectogram, which is a machine-readable representation of the audio file of your words.
An acoustic model cleans up your audio file by removing any background noises (for instance, a dog barking or static).
The algorithm breaks down the cleaned up file into phonemes. These are the basic building blocks of sounds. In English, for example, “ch” and “t” are phonemes.
The algorithm analyzes the phonemes in a sequence and can use statistical probability to determine words and sentences from the sequence.
An NLP model will apply context to the sentences, determining if you meant to say “write” or “right”, for example.
Once the ASR program understands what you’re trying to say, it can then develop an appropriate response and use text-to-speech conversion to reply to you.

While there’s often variation in the above process depending on the types of algorithms involved, this still provides an overview of how these systems work. ASR that uses NLP is by far the most advanced version due to lack of limitations and its ability to simulate real conversations. To illustrate, a typical lexicon for an NLP-based ASR system can include upwards of 60,000 words. ASR is evaluated based on its word error rate and speed; in ideal conditions, these systems can achieve close to 99% accuracy in understanding human speech (although notably, conditions are often less than ideal). Data scientists continue to experiment with how to teach ASR programs how to understand human speech. They’re exploring other methods that can complement usage of fully supervised learning, which requires training the AI on every possible language example it may encounter, and applying techniques, like active learning. The more people who interact with the program, the more it learns autonomously. As you can imagine, this is a significant time-savings to researchers.

Automatic Speech Recognition Applications

The possibilities for ASR applications are virtually limitless. So far, many industries have picked up this technology to enhance the customer experience. Here are a few applications that stand out: Voice-enabled Virtual Assistants: There are numerous popular examples of virtual assistants: Google Assistant, Apple’s Siri, Amazon Alexa, and Microsoft’s Cortana. These applications are becoming increasingly pervasive in our daily lives due to the speed and efficiency they offer for obtaining information. Expect the virtual assistant market to continue its upward trajectory. Transcription and Dictation: Many industries rely on speech transcription services. It’s useful for transcribing company meetings, customer phone calls in sales, investigative interviews in government, and even capturing medical notes for a patient. Education: ASR provides a useful tool for education purposes. For instance, it can help people learn second languages. In-car Infotainment: In the automotive industry, ASR is already widely-used to provide an improved in-car experience. Recent car models offer drivers the ability to make commands, such as “turn up the temperature two degrees.” The goal of these systems is to increase safety by ensuring the driver is hands-off on managing the car’s environment. Security: ASR can provide enhanced security by requiring voice recognition to access certain areas. Accessibility: ASR also serves as a promising tool for advancing accessibility. For instance, individuals who have trouble using technology can now make commands by voice on their smartphones; “Call Jane”, for example. Many of the above applications can be easily used across industries, so it’s unsurprising that the market for ASR technology has been expanding exponentially in recent years.

How to Overcome Challenges in Automatic Speech Recognition

We mentioned above how conditions for ASR are usually less than ideal, working in a live environment, which affects the accuracy rate of the technology. There are many common issues that contribute to these conditions and create challenges for teams implementing ASR. Luckily, there are steps you can take to overcome these barriers.

Challenges with ASR

A few common factors create challenges in the field of ASR: Noisy Data Noisy data is typically understood to mean meaningless data, but in the context of ASR it also has a literal meaning. In a perfect world, audio files would have crisp, clear speech with no background noise but the reality is often the opposite. Audio data can pick up on irrelevant noises, such as someone coughing in the background, a second person speaking over the primary speaker, construction noises, and even static. A quality ASR system will need to isolate the useful areas of audio and remove the meaningless portions. Speaker Variabilities ASR systems frequently need to understand people from different genders, parts of the world, and backgrounds. Here are the many ways in which speech can vary from person to person:

Language
Dialect
Accent
Pitch
Volume
Speed

For an ASR system to work well for all its end users, it likely needs to be able to understand and interpret a huge variation in speech. Poor Hardware Companies often don’t have high-quality hardware to use to capture audio, which leads to the aforementioned noisy data. Homophones and Contextual Difficulties In the English language alone, there are many words that are homophones, or words that sound the same but have a different meaning. An ASR system needs a highly-accurate NLP algorithm powering it in order to interpret the context of what each speaker is saying. Lack of Word Boundaries When we write or type, our words and sentences have clear boundaries: spaces and punctuation. When we speak, however, our words and sentences often blend together. It can be difficult for an ASR program to parse out which parts of our speech are separate words.

A Path to Success

Fortunately, most challenges can be at least partially solved through the use of a customized data collection and annotation project. This lets you collect (and potentially create) the speech data that best serves your customers and their variabilities. While you can choose to tackle this in-house, it can be helpful to leverage the expertise and tools of a third-party data provider with experience in ASR. The right data partner can connect you with the data you need for your specific use case and help you launch quickly with their data platform, as well as be inclusive with your ASR application. Many data providers feature off-the-shelf datasets for speech recognition, but should also be able to provide custom data for your needs derived from diverse groups of speakers. They should also have tooling that can help overcome the challenges noted above. Selecting the right partner will ultimately make a huge difference in determining the success of your ASR initiative.

Expert Insight from Chi Zhang – Director of Data Science

Get to know your ASR application use scenario before the training: Standard corpora like TIMIT, Librispeech have been widely used for the initial training of ASR models or benchmark testing for the existing ASR models. But it’s the data from the use scenario of your application that plays a more crucial role in the performance of your final delivered ASR models. At the training data collection or assembling phase of your ASR model, define the use scenario of your ASR application, knowing what recording instruments, acoustical environments, domain specific terminologies & vocabulary, even users age, gender, health condition distribution to which your application will face. Using the training data matching with the use case scenario will guide your ASR application towards better performance. Pay careful attention to the language model you are using: After deciding on the use scenario of your ASR application and obtaining the corresponding data for training, a domain specific language model may help to deliver a better performance for your ASR application versus using a general language model. Thanks to recent developments, even in specific domains or in an application, new vocabulary, abbreviation and terminology are always being developed around the world. Based on this trend, retraining your language model or tuning the hot-word weights in your language model will enable your ASR application to keep up with users with a non-degraded performance. Iteratively update your models: new instruments and front-end techniques are emerging at a much faster pace in recent years. New vocabulary, terminology come and go with user trends and events. The ASR models, both acoustic model and language model need to be re-trained or updated iteratively with new data reliably annotated and transcribed, such that the user experience of your ASR application will remain consistent or be improved.

What We Can Do For You

At Appen, we provide high-quality annotated training data to power the world’s most innovative machine learning and business solutions. We help build intelligent systems capable of understanding and extracting meaning from human text and speech for diverse use cases, such as chatbots, voice assistants, search relevance, transcription, and more. Many of our annotation tools feature Smart Labeling capabilities, which leverage machine learning models to automate labeling and enable contributors to work quickly and more accurately. We understand the complex needs of today’s organizations. For over 25 years, Appen has delivered the highest quality linguistic data and services, in over 235 languages and dialects, to government agencies and the world’s largest corporations. Learn more about our ASR capabilities, or contact us today to speak with someone directly.

What is Automatic Speech Recognition?