Key Metrics to Monitor Data Quality

AI is only as good as the data that powers it

The key to successful implementation of an AI powered program or device is dependent on the data used to train the model. Using poor quality training data results in a poorly trained model that can require additional time and budget to retrain and test. The best way to prevent this is to implement quality checks in the model training process. It’s important to note that not all quality metrics serve the same purpose, and some are better suited to different types of data than others.

These metrics are:

  • Inter-rater Reliability – single and double review, audits
  • F1 Score – precision, recall
  • Accuracy – golden datasets, quizzes

Not all data is created equal, nor are metrics. Different types are suitable for different project needs.

Inter-rater Reliability

Single review is the process of having two separate contributors annotate the same piece of data (one to annotate and one to confirm it’s annotated correctly) and checking to see if it matches. If it does, the data is determined to be annotated correctly. If the two disagree, then double review is needed. A third contributor works on the piece of data. If it matches either of the first two, that’s deemed the correct answer. If there are no matches, the data is thrown out and goes through the process again. This process is not a 100% match or no-match situation. If desired, partial matches can be allowed. This is where having an accuracy threshold comes into play—if that threshold is not met, data won’t be high-quality enough to train the model to function as intended.

Auditors can work in combination with single and double review or operate separately. Auditors are experienced contributors who consistently prove they maintain high quality who are tasked with evaluating completed data to check it was annotated correctly. These auditors also provide feedback to those who worked on the data, letting them know if something was done incorrectly. It’s wise to have more than one auditor working on a project to allow more data to be audited and prevent bad data from making its way to the model.

F1 Score

F1, often used in classification datasets, is a score of the model’s predictive accuracy based on the provided training data. There are two metrics essential to calculating this score: recall and precision. Recall refers to the fraction of the relevant items that are retrieved. Precision refers to the fraction of the retrieved items that are relevant. Customers find F1 helpful for finding a balance between the precision and recall scores in their data labeling. For those only needing a score on either precision or recall, F1 is not as beneficial.


Quiz based accuracy is measured by tests administered before and during the project. Pre-screening is the process our crowd goes through to make sure they understand how to specifically annotate data based on project requirements. There’s a set number of questions they need to answer correctly to be allowed to work. Throughout the project additional quizzes are given.

Another method of conducting quizzes is through golden datasets, these are pre-labeled pieces of data that are integrated into a dataset that’s being annotated as a quiz. After the embedded quizzes are annotated by a single person, an accuracy score is provided. If each contributor achieves a certain score on either quiz method, they can continue to work on the project. These types of tests allow project owners to easily identify anyone not meeting project requirements and remove them and the data they’ve worked on from the model being trained

Obtaining the Right Data

It’s not enough to make sure obtained data is annotated accurately according to project requirements; is the data needs to be beneficial for the program or device as well as complete. Complete data covers all possible use cases required to successfully train the model.  

There’s four main ways to source data:

  • Collect manually
  • Use a hybrid model of technology and human-in-the-loop
  • Use a pre-labeled dataset (PLD)
  • Use synthetic data

Manually obtaining all data needed is an excellent option if there are no budget or time restrictions. Businesses needing to expediate the process can use a PLD. We have more than 250 PLDs available on our site, ready to use right off the shelf. A hybrid model can also be utilized where pre-labeled data is used as a starting point and then humans work on getting the remainder of the data ready for model training.

Alternatively, if the data is sensitive in nature (medical and financial for example), it’s beneficial to use generated data where the values aren’t associated with a live human. Generated data, known as synthetic data, can create data free from personally identifiable information (PII) and is an ideal choice for hard to come by edge cases. We partnered with Mindtech to bring these synthetic data solutions to our customers.

Quality Data Starts with Annotators

One essential way to guarantee data is high quality is to use dedicated annotators who are committed to labeling data accurately and can adhere to project requirements. At Appen, we have a dedicated crowd of over one million people living around the globe. Through our managed services, crowd members pass rigorous pre-screening labeling tests which ensure they’re capable of annotating the data accurately according to project requirements.

To confirm annotation is being performed correctly throughout the entire process, data will need to be checked for quality. This is commonly done through the process of auditing. Typically done by auditors, they follow the same process of pre-screening to prove they can maintain project requirements and prevent poorly labeled data from being used to train a model.

Subjective VS Objective Quality

It’s important to note quality metrics aren’t always definitive. They can be placed under two categories, subjective and objective.

Some examples of use cases:

  • Objective use cases: classification and segmentation
  • Subjective use cases: relevance ranking and sentiment analysis

Objective use cases typically contain straightforward answers. Examples are asking if an image contains French fries or to place a bounding box around bicycles. Variation in answers is commonly seen in subjective use cases. Examples include asking a person if the result they see is relevant to the topic they searched or asking if written content they’re reviewing contains a positive message. With these examples each person will give slightly different answers because no two humans are alike. These objective metrics gather the consensus on how the user interacting with the item or program will likely perceive it. F1 and quiz-based quality metrics are great for dealing with more objective datasets, while inter-rater reliability excels with subjective datasets.

How Appen Helps

We have an extensive crowd of over 1 million contributors across the globe who are accustomed to working on projects with all types of data, producing quality results, and working with auditors. All our project and program managers are experienced in dealing with all quality metric types and will work with you to make sure your project reaches the desired goals. Our Appen Data Annotation Platform (ADAP) is capable of gathering data and completing annotation for your metric of choice.

Not sure which quality metric is best to use for your next project? Reach out and we’ll be happy to help you decide what to use.  


Website for deploying AI with world class training data