How Automatic Speech Recognition Works
ASR has come a long way in the last few decades thanks to the power of AI and machine learning algorithms. More basic ASR programs today still use directed dialogue, while advanced versions leverage the AI subdomain of natural language processing (NLP).Directed Dialogue ASR
You may have experienced directed dialogue when you’ve called your bank. For larger banks, you’ll usually have to interact with a computer before you reach a person. The computer may ask you to confirm your identity with simple “yes” or “no” statements, or to read out the digits in your card number. In either case, you’re interacting with directed dialogue ASR. These ASR programs only know short, simple verbal responses and in turn have a limited lexicon of responses. They’re useful for brief, straightforward customer interactions but not for more complex conversations.Natural Language Processing-based ASR
As mentioned, NLP is a subdomain of AI. It’s the method for teaching computers to understand human speech, or natural language. In the simplest of terms, here’s a general overview of how a speech recognition program leveraging NLP can work:- You speak a command or ask a question to the ASR program.
- The program converts your speech into a spectogram, which is a machine-readable representation of the audio file of your words.
- An acoustic model cleans up your audio file by removing any background noises (for instance, a dog barking or static).
- The algorithm breaks down the cleaned up file into phonemes. These are the basic building blocks of sounds. In English, for example, “ch” and “t” are phonemes.
- The algorithm analyzes the phonemes in a sequence and can use statistical probability to determine words and sentences from the sequence.
- An NLP model will apply context to the sentences, determining if you meant to say “write” or “right”, for example.
- Once the ASR program understands what you’re trying to say, it can then develop an appropriate response and use text-to-speech conversion to reply to you.
Automatic Speech Recognition Applications
The possibilities for ASR applications are virtually limitless. So far, many industries have picked up this technology to enhance the customer experience. Here are a few applications that stand out: Voice-enabled Virtual Assistants: There are numerous popular examples of virtual assistants: Google Assistant, Apple’s Siri, Amazon Alexa, and Microsoft’s Cortana. These applications are becoming increasingly pervasive in our daily lives due to the speed and efficiency they offer for obtaining information. Expect the virtual assistant market to continue its upward trajectory. Transcription and Dictation: Many industries rely on speech transcription services. It’s useful for transcribing company meetings, customer phone calls in sales, investigative interviews in government, and even capturing medical notes for a patient. Education: ASR provides a useful tool for education purposes. For instance, it can help people learn second languages. In-car Infotainment: In the automotive industry, ASR is already widely-used to provide an improved in-car experience. Recent car models offer drivers the ability to make commands, such as “turn up the temperature two degrees.” The goal of these systems is to increase safety by ensuring the driver is hands-off on managing the car’s environment. Security: ASR can provide enhanced security by requiring voice recognition to access certain areas. Accessibility: ASR also serves as a promising tool for advancing accessibility. For instance, individuals who have trouble using technology can now make commands by voice on their smartphones; “Call Jane”, for example. Many of the above applications can be easily used across industries, so it’s unsurprising that the market for ASR technology has been expanding exponentially in recent years.How to Overcome Challenges in Automatic Speech Recognition
We mentioned above how conditions for ASR are usually less than ideal, working in a live environment, which affects the accuracy rate of the technology. There are many common issues that contribute to these conditions and create challenges for teams implementing ASR. Luckily, there are steps you can take to overcome these barriers.Challenges with ASR
A few common factors create challenges in the field of ASR: Noisy Data Noisy data is typically understood to mean meaningless data, but in the context of ASR it also has a literal meaning. In a perfect world, audio files would have crisp, clear speech with no background noise but the reality is often the opposite. Audio data can pick up on irrelevant noises, such as someone coughing in the background, a second person speaking over the primary speaker, construction noises, and even static. A quality ASR system will need to isolate the useful areas of audio and remove the meaningless portions. Speaker Variabilities ASR systems frequently need to understand people from different genders, parts of the world, and backgrounds. Here are the many ways in which speech can vary from person to person:- Language
- Dialect
- Accent
- Pitch
- Volume
- Speed