Make your Speech Recognition System Sing

What You Can Gain From Using Voice Recognition Datasets for Machine Learning

Your data can be the difference between an efficient and cost-effective voice recognition system and one that doesn’t work very well. When it comes to machine learning, one of the most important components for a successful launch and return on investment is data. If you’re planning to build a voice recognition system or conversational AI, you’ll need a big speech recognition dataset. Pre-labeled datasets could be the solution. One of the struggles that many companies face today is how to get the data they need and to ensure that they’re getting high-quality data, which will help them build a successful machine learning model.

How Speech Recognition Datasets Can Benefit Your Organization

The importance of pre-labeled datasets is in how they can benefit your company or organization. Pre-labeled datasets allow organizations to get to the deployment phase faster and with spending less money. When you opt for a pre-labeled dataset instead of building your own or purchasing a custom dataset, you can spend the majority of your team’s time and money on building and training your speech recognition model. When you’re less focused on collecting and labeling data, all of your resources can be spent on building and training your model, which results in a higher quality, better model. When you have a better model, you get a higher return on your investment, with better results and better insights. No matter where you are in the world, you can benefit from pre-labeled data at your organization. Pre-labeled datasets offer better data at a more affordable cost, allowing more organizations to effectively build and launch speech recognition machine learning models.

Pre-labeled Datasets in Practice

An example of a pre-labeled dataset in practice comes from MediaInterface. While MediaInterface has been working with healthcare-related institutions and collecting data for over 20 years, the vast majority of their data is in German, which is the language spoken in their primary markets. When MediaInterface wanted to expand to France, they needed data. Another hurdle they faced is that much of the place name data was redacted due to GDPR protections and guidelines. That’s when MediaInterface came to Appen. Using one of Appen’s pre-labeled datasets, MediaInterface was able to get 21,000 French names and 14,000 place names in their dataset. This data helped them to launch efficiently in a new market.

voice recognition with smart phone

Through the use of a pre-labeled dataset, MediaInterface was able to efficiently launch in a new market while not incurring large costs.

Pre-Labeled Speech Recognition Datasets

Pre-labeled datasets are a newer option for companies that don’t have the time or resources to build their own custom dataset. A pre-labeled speech recognition dataset is a set of audio files that have been labeled and compiled for being used as training data for building a machine learning model for use cases such as conversation AI. The beauty of pre-labeled datasets is that they’re built and ready to go. Before the use of pre-labeled datasets, companies had to either build their own dataset from scratch, collecting and labeling each data point, or hire a company to build the dataset for them. Both building your own and buying a custom dataset are hard on company resources, costing money or time. Now, there are a wealth of options out there for pre-labeled speech recognition datasets. When it comes to pre-labeled datasets, you’ll find two options: for purchase or open source. Both options have their place, you’ll just have to find the right one for your company. Across the internet, you’ll find a dozen or more resources for finding and purchasing pre-labeled speech recognition datasets. At Appen, we have over 250 datasets, which include audio datasets with over 11,000 hours of audio and 8.7 million words across 80 different languages and multiple dialects.

Examples of Pre-Labeled Datasets Available for Purchase

Pre-labeled datasets, whether you’re getting them from us or another vendor, are a great resource for jumpstarting an AI or machine learning project. Because a pre-labeled dataset is already built, you can jump directly to training your model with no delays. Using a pre-labeled dataset is cost-effective and speeds up your time to deployment. While building or buying your dataset would take an average of eight to twelve weeks from start to finish, you can purchase and receive a pre-labeled dataset in days to a week. There are a number of online resources for finding pre-labeled speech recognition datasets. You can start on our website and filter for audio datasets or check out any of the other paid or open-source dataset resources we suggest below. Each of the below databases includes speech audio files and text transcriptions that you can use to build up your Speech Corpora with the utterances from a variety of speakers in a number of different acoustic conditions, making for high-quality, varied data.

Appen: Arabic From Around the World

Our repository of pre-labeled speech recognition datasets includes a number of different sets for Arabic being spoken around the world. We have datasets of Arabic speakers in Egypt, Saudi Arabia, and the UAE.

Appen: Baby Crying

One of our newest pre-labeled audio datasets is of pre-recorded and annotated baby sounds. In these audio files, you’ll hear different baby cries and sounds. This dataset would be great for training AI models to recognize different infant sounds and types of cries, which would then be able to alert parents.

Appen: Less Common Languages

One of the major issues with the pre-labeled datasets you’ll find on the market is that they focus on European languages or English. Our repository of pre-labeled datasets includes less common languages, such as:
  • Bahasa Indonesia
  • Bengali (Bangladesh)
  • Bulgarian (Bulgaria)
  • Central Khmer (Cambodia)
  • Croatian
  • Dari (Afghanistan
  • Dongbei (China)
  • Greek
  • Hungarian
  • Pashto
  • Polish
  • Turkish
  • Uygur (China)
  • Wuhan Dialect (Chinese)
This is just a small selection of the languages and dialects that you’ll find in our over 100 speech recognition pre-labeled datasets.

Appen: Non-Native Chinese speakers

Another dataset included in our pre-labeled product, speech recognition repository is a dataset of non-native Chinese speakers speaking in Chinese. This type of dataset can be great for creating a wider variety of speakers and accents in your training dataset which will result in a better-performing machine learning model. This dataset includes 200 hours of foreigners speaking Chinese. Speakers come from countries such as:
  • Argentina
  • Australia
  • Canada
  • Egypt
  • Hong Kong
  • India
  • Indonesia
  • Japan
  • Kazakhstan
  • Kenya
  • Korea
  • Kuala Lumpur
  • Kyrgyzstan
  • Laos
  • Malaysia
  • Mauritius
  • Mongolia
  • Philippines
  • Russia
  • Singapore
  • South Africa
  • Tajikistan
  • Thailand
  • Turkey,
  • United States
  • Vietnam
While this dataset is quite inclusive, it doesn’t include data from South Korea or Brazil. There’s also no data recorded by minors. To protect privacy, all sensitive and personal information has been scrubbed.

Appen: Languages Spoken Across the Globe

Another unique feature of our pre-labeled datasets is that you can get datasets for one language but spoken in different regional dialects. For example, German isn’t only spoken in Germany. If you’re creating a machine learning model for German speakers, your data will be incomplete if you have a dataset that Features only German speakers from Germany. These around the world datasets include:
  • English
  • French
  • Spanish
  • German
  • Italian
Our pre-labeled datasets have a comprehensive collection of different languages, but also a variety of dialects.


A non-Appen pre-labeled dataset that we highly recommend is that from LibriSpeech. This dataset was put together as part of the LibriVox project which includes data compiled from people reading audiobooks. The dataset includes about a thousand hours of speech data that’s been segmented and labeled.

M-AI Labs Speech Dataset

Another common issue with speech recognition datasets is they’re not representative of gender, they often feature male voices heavily and have few female voices, which can cause gender biases in the abilities of voice assistants and other machine learning models. That’s why we recommend M-AI Labs Speech Dataset in our list of pre-labeled datasets. It has almost 1000 hours of audio paired with transcriptions and represents male and female voices across several languages. There are a number of different sources where you can find high-quality, pre-labeled datasets to use to train your machine learning model and get to the deployment stage efficiently.

Open Source Speech Recognition Datasets

Using a pre-labeled dataset to train your speech recognition machine learning model is an efficient and cost-effective way to get to deployment. But, if you’re on a really tight budget for development, there’s another, even less expensive option out there for you. Open source speech recognition datasets are available and free to use. These open datasets include audio files and text transcriptions that have been put together by various groups or people. You can find open-source datasets from a variety of different sources online. You may have to spend a little extra time researching to find an open-source dataset and verifying its quality, but the extra time can save you quite a bit of money. Here are a few open-source speech recognition datasets we recommend trying.


A great place to find open-source speech recognition datasets is Kaggle. Kaggle is an online community where data scientists and machine learning engineers gather to share data, ideas, and tips for building machine learning models. On Kaggle you can find over 50,000 open-source datasets for a wide variety of use cases.

Common Voice

Another great open-source speech recognition dataset comes from Common Voice. This dataset consists of over 7000 hours of speech in over 60 different languages. What sets this dataset apart from others is that includes metadata tags for age, sex, and accent which can help you to train your machine learning model and create accurate results.


Coming from the National Institute of Korean Language, homink is a speech corpus that includes 120 hours of people speaking Korean. This specialized open-source dataset is a great resource for those working on machine learning projects and wanting to include the Korean language.


Another unique open-source dataset is siddiquelatif. This dataset includes 400 utterances in Urdu, which have been collected from Urdu talk shows. The utterances represent both male and female speakers and a variety of emotions. Open source datasets can sometimes lack in size and quality when compared to pre-labeled datasets that are available for purchase, but they’re a great option if you’re looking to launch your machine learning project on a tight budget. With a little research and digging you can find high-quality open-source speech-recognition datasets.

Potential Problems with Speech Recognition Data

One of the critical elements of machine learning model training data is quality. If you put high-quality training data into your machine learning model, you’ll get high-quality results out. If you’re not using high-quality data, your results won’t be as good. While high-quality data may seem like a nebulous concern, there are a few big problems to watch out for when examining and choosing a pre-labeled dataset.

Overlooking Less Common Languages

Many pre-labeled datasets aren’t representative of all languages or even of the most commonly spoken languages. When looking through pre-labeled datasets online, you’ll notice that there are certain languages that it’s more difficult to find datasets for. This language bias can make creating and training a representative machine learning model a struggle. While this bias exists, you can also find a number of programs working towards correcting the bias. For example, the open-source dataset homink and siddiquelatif represent Korean and Urdu, respectively. Another database for under-represented languages comes from The Computer Research Institute of Montreal. This database makes it easier to access recordings of Indigenous languages being spoken and to create reliable transcriptions. The indigenous languages included in this database are:
  • Inuktitut
  • East Cree
  • Innu
  • Dénésuline
While you might be able to find other datasets of Indigenous languages being spoken, what makes this dataset unique is the annotations and indexing. This database can be searched using keywords, perform speech segmentation, and use language labeling tools. This type of high-quality dataset makes it possible to create automatic speech recognition for Indigenous languages. It’s important when looking for pre-labeled datasets and building speech recognition machine learning models to be aware of potential bias. Be looking for bias in your datasets and try to avoid building it into your model.

Using Biased Data

Another major problem with pre-labeled datasets is biased data. When it comes to data and speech recognition machine learning models, there are a number of different forms of bias. The two most common forms of bias are gender and racial bias. In general, machine learning models on the market are less capable of recognizing speech from women and people of color. And while speech recognition software has made progress in recent years, it’s not enough. A 2020 Stanford University study looked at speech-to-text transcriptions from 2000 voice samples for services from Amazon, IBM, Google, Microsoft, and Apple. They found that those speech-to-text services misidentified words from Black speakers at nearly double the rate of misidentification of words spoken by white speakers. This bias shows a lack of data diversity and a bias in training data. To deploy a successful machine learning model, it’s critical that your data be representative of the whole population, not just a portion of the population. Racial bias isn’t the only bias that speech recognition machine learning models are facing. Research has also found gender bias in speech recognition models. Research done by Dr. Tatman and published in the North American Chapter of the Association for Computational Linguistics found that Google’s speech recognition software was 13% more accurate for men than women. This difference may seem small, but it’s important to note that Google has the least gender bias when compared to Bing, AT&T, WIT, and IBM Watson. Like any machine learning model, speech recognition models learn by being trained on a large amount of data. This is why the quality of your training dataset is so critical to deploying a successful machine learning model. If you use biased, low-quality data, your model will produce biased, low-quality results. The system will mimic the biases found in the data. Even when these biases are unintentional, they can still be harmful to users and to the company’s bottom line. The more diverse your data, the less biased your machine learning model.

How to Avoid Bias in Speech Recognition Data

When building a machine learning model, it’s critical to use unbiased training data to ensure the success of your model and a high return on your investment. Eliminating and avoiding bias in your machine learning model isn’t a one-and-done step. Getting rid of bias requires attention to detail, planning, and thoughtfulness. A few small examples of how you can lower bias in your machine learning models include:
  • Provide implicit bias training to improve bias awareness. Resources such as Harvard’s Project Implicit and Equal AI provide programs and workshops.
  • Search for less biased data and don’t settle for the first pre-labeled dataset you find.
  • Investigate data providers and review their writing on bias in AI
  • Use a diverse group of testers to catch bias before you launch your machine learning model
  • Acknowledge that bias is part of our world and part of our data
As machine learning models become a bigger part of our everyday lives, it’s critical that the technology be able to be used by everyone — equally.

Create AI That Learns and Adapts

A big shift in machine learning models that can help to eliminate bias is building models that learn and adapt as they’re used. When machine learning models can learn as they go, they’re better able to adapt to different subsets and groups of people and environments, which makes them more adaptable and less biased. An example of this in action comes from Verbit, an in-house AI that gets smarter with each use. Users have the ability to upload a glossary of terms, including speaker names and complex words so that the machine learning tool can recognize those words more easily and create more accurate transcriptions. As well, the model can learn from corrections that are put in later when the transcription is reviewed by humans. This back-and-forth between human and model allows the model to constantly be learning, changing, and adapting. This makes for a less biased model that can be used by everyone. Like this example, AI should adapt to the user, not the user adapting to the AI. There’s no need to settle for mediocre results when machine learning models have the capability to continuously learn and improve the more people it interacts with.

Diversity in Hiring

When it comes to bias, you can’t just play the short game. Bias is a part of our culture and to eliminate it in our technology, we have to lessen it in our communities. This means making changes to hiring practices. When your team is more representative, your machine learning model and data will be more representative. The more diversity you have sitting at the table reviewing projects, decisions, and data, the less likely you are to build implicit bias into your machine learning models. We naturally, and understandably, build for our own. But, that doesn’t make for the best products or models. To build the best products that work for everyone, it’s critical to involve more diverse people in the process. This starts in your hiring practices.

How Appen Can Help

If you’re looking for a high-quality, pre-labeled dataset to help train your speech recognition model, Appen has what you need. We have a wide variety of pre-labeled datasets that can be used for various use cases. With datasets representing over 80 different languages and dialects, you’ll be able to find just the right data that you need. At Appen, we also strive to provide representative, unbiased data. No matter what you’re looking for, we have the resources to help you. Choose from our pre-labeled datasets for speech recognition, purchase a custom-made speech recognition dataset from us, or, if we don’t have it, let us help you find the right pre-labeled dataset for your use case. From start to finish, we have the tools you need to deploy your speech recognition machine learning model. Learn how a pre-labeled dataset could save you time and money.
Website for deploying AI with world class training data