Ethics at Every Stage of the AI Lifecycle: Data Preparation

How to reduce bias in your training data

As AI receives wider market adoption and is implemented as a tool across use cases, more challenges appear. These projects run into concerns that weren’t apparent at infancy of AI–one critical and persisting issue is ethical AI and how to manage bias in data.

Data bias is an error that overweighs or underrepresents a certain element in the dataset. When biased data is used to train an AI or machine learning model, it results in skewed outcomes, prejudice, and inaccuracy. Appen is taking a deep dive into what ethical AI data looks like at each stage of the AI lifecycle.

At each step of the data journey, there are common mistakes that could result in biased data. Thankfully, there are ways to avoid these pitfalls. In this article series, we are exploring data bias in the AI lifecycle at each of the four stages:

  • Data sourcing
  • Data preparation
  • Model training and deployment
  • Model evaluation by humans

Not all datasets are created equal, but we want to help you navigate the complicated ethics of data for the AI lifecycle so you can create the best, most useful and responsible dataset for your AI model. 

Bias in Data Preparation

Before data can be used to train an AI model, it must be readable and in a usable format. The second step in the AI data lifecycle is data preparation and it’s all about taking a raw set of data, sorting it, labeling it, cleaning it and reviewing it. Appen provides customers with data preparation services including human annotation and automated AI data annotation. Leveraging the combination of both delivers the highest quality data with the lowest possible bias. 

Data preparation starts off with annotators reviewing each piece of data and providing it with a label or annotation. Depending on the type of data, this can look like:

  • Placing bounding boxes around objects in an image
  • Transcribing audio files
  • Translating written text from one language into another
  • labeling text or image files

Once data has been labeled by our global crowd of human annotators, it moves into the next phase of data prep: quality assurance. The QA process requires human annotators and machine learning models to review the data for accuracy. If data isn’t a good fit for the project or is mislabeled, it’s removed from the dataset. 

At the end of the data preparation stage, a dataset is moved on to model training. Before a dataset moves to this phase, it’s imperative that it’s consistent, complete, and clean. The highest quality data creates a high-quality AI model. 

There are several ways that bias can slip into the data preparation process and create ethical concerns that will then feed into the AI model. Some of the most common types of bias in data preparation include:

  • Gaps in data
  • Improperly trained data annotators
  • Inconsistent labelling
  • Personal bias
  • Too much or too little data

You’ve Got a Gap in Your Data

One of the most common ways that bias sneaks into AI datasets is through data gaps and underrepresentation. If certain groups or types of data are missing from your dataset, it will cause the data and resulting AI model output to be biased. Common data gaps include underrepresenting minority populations. It can also be an underrepresentation of a type of data or rare use case example. 

Data gaps are often unintentional, it’s imperative to review your data during the preparation stage to look for those data gaps. If a data gap isn’t addressed by adding more representative data, the gap will be part of the data used to train your AI model and the model will produce less accurate results. 

Your Data Annotators Aren’t Well Trained

Another common way that bias is introduced during the data preparation stage is when untrained data annotators are used to label data. If data annotators aren’t trained to understand the importance of their job, you’re more likely to see labeling mistakes or shortcuts taken during the labeling process.

Providing data annotators with thorough training and supportive oversight can limit the number of mistakes made during the data preparation process. There are a few ways in which untrained data annotators can introduce bias during the labeling process including inconsistent labelling and personal bias. 

Inconsistent Labeling

If multiple annotators are labeling a single dataset, it’s critical to train all annotators as to how to label each data point in order to create consistency. When similar types of data are labeled inconsistently, it creates recall bias which results in a less accurate AI model.

Personal Bias

Another way that data annotators can introduce bias into the labeling process is through their own personal bias. Each of us holds a unique set of biases and understanding about the world around us. That understanding helps annotators to label data, which can introduce bias into the data. 

For example, if annotators are labeling images of facial expressions with the displayed emotion, it’s likely that annotators from two different countries may provide different labels. This type of bias is inherent to data preparation but can be limited with a thorough QA process. As well, companies can provide data annotators with unconscious bias training to try to limit the effects on their data labeling. 

You’re Using Human-Only or Machine-Only Annotation

In the past, the only way to label data was by having a human review each piece of data and provide it with a label. Recently, technology has been developed where machine learning programs can label data to create training data sets.

As always, the debate has raged: which is better? We like to take a both-and stance of using human annotators to label data and using machine learning programs to run a quality assurance check on the labels. This creates the highest quality data set possible. 

You Have Too Much or Too Little Data

Another important consideration when assessing your data during the preparation phase in ensuring that you have the right amount of data. It’s possible to have both too little and too much training data.

If you have too little training data, the algorithm won’t be able to understand the patterns in the data. This is known as underfitting. When there’s too much data, the model’s output will be inaccurate because it’s not able to determine what is noise and what is real data. Providing your model with too much data is known as overfitting. 

Creating a dataset that’s just the right size for your AI model will increase the quality of the model output. 

You Exclude “Irrelevant” Data

During the data preparation process, it’s important to review your data and to remove any data from the dataset that doesn’t apply to your future model. Be sure to double check carefully before removing the data because what might appear to be “irrelevant” initially or to one person may not actually be irrelevant. Removing “unimportant” data at this stage can result in exclusion bias. Just because one part of the dataset is small or uncommon doesn’t mean it’s not important.

Solutions to Bias in Data Preparation

While there are a number of ways that bias can slip into your dataset during the preparation process, there are just as many solutions. Below, you’ll find some of the ways that bias can be avoided in the data preparation process. 

Hire Diverse and Representative Staff

One of the most important ways to remove bias from the data preparation process is to ensure that the decision-makers and those at the table are broadly representative. Hiring diverse staff can go a long way toward limiting bias in your AI training datasets. 

While hiring diverse staff is the first step, you can also take it a step further and provide all staff with unconscious bias training. Unconscious bias training can help people to better recognize their own personal biases and to consciously look for bias in the data they are labeling.

Add Bias Checks to Your QA

If there’s only one thing that you do to try to limit bias in your data preparation, it should be to add bias checks to your quality assurance process. Most bias is unintentional. This means that it’s slipping into the data because no one recognizes it or thinks to look for it.

By adding bias checks to your quality assurance process, you make bias conscious. It helps to remind your employees to explicitly look at the data for bias and to think critically about what should and shouldn’t be represented in the data. Providing staff members with unconscious bias training will make it easier for them to look for and remove bias during the data preparation process. 

Annotators are Paid Well and Treated Fairly

Bias is pervasive in AI data. It takes a keen eye and thorough training to be able to identify data gaps. A simple way that companies can begin to fight bias in AI training datasets is by making sure that their data annotators are well paid and treated fairly. 

Employees that are well compensated for their work are more likely to take a vested interest in producing high-quality content. When you take care of your employees, you are more likely to get high-quality work in return. At its heart, ethical AI starts with the people who are annotating and cleaning data in order to train AI models. If these people aren’t well compensated for their work, bias is more likely to proliferate. 

If we want to build a better, more ethical world for AI models, it starts at the very beginning: with data. The AI lifecycle includes four stops for data, all of which have a possibility to introduce bias into your training dataset. During the data preparation phase, it’s critical to have well trained and compensated employees who are cognizant of unconscious bias and can help to eliminate as much as possible. 


Website for deploying AI with world class training data