The Hunt for Human Speech Data

With voice activated devices launching weekly, one might think that we’re reaching a tipping point in the use of speech recognition technologies. However, a recent Bloomberg article argues that while speech recognition has made great strides in recent years, the approach taken to speech data collection has prevented the technology from reaching a point where it would replace how most consumers currently interact with their devices. Consumers have embraced the concept of voice activated devices with enthusiasm, but the actual experience has room for improvement. What is holding the technology back?  

More data = better performance

According to the authors, what is needed to improve devices’ abilities to better understand and communicate with users is terabytes of human speech data representing multiple languages, accents and dialects to deepen the conversational understanding capabilities of the devices. Recent advances in speech engines are the result of a form of artificial intelligence called neural networks which learn and change over time without precise programming. Loosely modeled after the human brain, these software systems can train themselves to make sense of the human world, performing better with increased amounts of data. Andrew Ng, Baidu’s chief scientist says, “The more data we shove in our systems the better it performs. This is why speech is such a capital-intensive exercise; not a lot of organizations have this much data.” Tech giants including Amazon, Apple, Baidu and Microsoft are now racing to collect natural language data across the globe to improve accuracy. As Adam Coates from Baidu’s AI lab in Sunnyvale, CA states, “Our goal is to push the error rate down to 1 percent. That’s where you can really trust the device to understand what you’re saying, and that will be transformative.” How can these firms scale their data collection in a cost-effective way while ensuring the human speech data accurately captures the nuance of human language?  

It’s about quantity AND quality

While the quantity of data is important, the quality is also critical to optimize machine learning algorithms. ‘Quality’ in this context includes how well the data fits the use case. For example, if a speech recognition engine is being developed for use in a car, the data needs to be collected in a car for best results, taking into account all of the typical background noises that the engine would ‘hear’. While it’s tempting to use ‘off the shelf’ data and to try to collect the data using ad-hoc methods, it’s more effective in the long run to collect data specifically for its end-use. This principle also applies when building global speech recognition products. Human language data is nuanced, accented and full of cultural bias. Data collection must be undertaken in a multitude of languages, geographies, locations and accents to reduce error rates and improve performance.  

Partner with Appen

At Appen, we continue to play a key role in the evolution of natural language processing and conversational understanding, having spent the last 20 years working with the top global technology companies to build speech and natural language interfaces for the leading virtual assistants on the market today. We have years of experience working in language data collection in a wide range of environments, from in-studio to outdoors, using a variety of modalities. Contact us here to discuss your specific data collection needs and how we can help you meet your goals.  
Website for deploying AI with world class training data