How to Create Training Data for Computer Vision Use Cases

For simple computer vision projects, such as recognizing a pattern in a group of images, publicly available image datasets will usually suffice to train your machine learning models. But for more complex CV applications, how can you get the vast amounts of training data you need to create an accurate solution? In this post, we explain training data requirements for computer vision use cases like video understanding, autonomous driving, security monitor surveillance systems, and medical image diagnostics. For any real-world computer vision application, the key to success is the right quality and quantity of training data.

Abstract photo of facial detection How do I create the right kind of dataset for my project?

You will need to collect as much real-world image data as possible for your use case scenarios — namely, annotated images or videos. Depending upon the complexity or safety requirements of the solution, this might mean collecting and annotating millions of images. While leveraging existing open-source datasets like ImageNet and COCO is always a good place to start, the more data samples you’re able to collect for your specific use case, the better. If the use case doesn’t require very specific data or proprietary data, some companies opt to purchase existing datasets from vendors. When there is not an existing body of data available, most companies opt to work with training data providers like Appen. For example, we can deploy our global crowd of workers to collect image and video data using our mobile recording tools, per our customers’ specific scenario requirements, and annotate large volumes of existing image and video data. With a large, diverse data set to learn from, your ML model will be robust and successful at identifying subtleties and avoiding false positives. This is especially important for solutions like training data for autonomous driving, which must accurately identify the difference between a small child playing in the street and a grocery bag blowing in the wind. In this scenario, there may be color, size, and shape similarities that could confuse your CV system if it isn’t adequately trained.

How much training data do I need for a computer vision solution?

So how many images do you need annotated to train your system? The short answer is that it can range from several thousands to millions of images, depending upon the complexity of the computer vision or pattern recognition scenario. For example, if your CV solution needs to be able to categorize eCommerce products into a relatively small number of coarse-grained categories (i.e., shirts, pants, shoes, socks, dresses), you may only need several thousand images to train it. For a more complex taxonomy of categories — for example, classifying images into thousands of fine-grained categories such as men’s running shoes, girls’ fashion heels, baby shoes, etc. — the system might require several million correctly labeled images to be trained.

How can I improve the quality of my training data?

Image annotation is vital for a wide range of computer vision applications, including robotic vision, facial recognition, and other solutions that rely on machine learning to interpret images. To train these solutions, metadata must be assigned to the images in the form of identifiers, captions, or keywords. In most cases, a human touch is necessary to correctly identify all the nuances and ambiguity that can often occur in complex images like traffic camera reporting and photos of crowded city streets. Appen’s image annotation tool incorporates AI to significantly improve our image annotation workers’ efficiency. The AI-assisted image annotation tool takes a first pass at outlining the objects in the task. For example, if the annotation task is to mark out the shape of all cars in an image, Appen’s image annotation tool will automatically form lines or bounding boxes around the car, and then the worker only needs to adjust a few points of the car shape if it’s not perfectly aligned. This process is much faster and more efficient than asking a worker to draw the car shape from scratch. You must also make sure your training data covers every possible real-world scenario it might encounter, so your CV system can succeed in a real-world setting. To this end, there are some methods for enriching your image data that are quite straightforward. Common ways to help train your ML model to cope with the nuances it will encounter in the real world include rotating or cropping images, as well as changing color and light values. Manipulating your data in this way is a simple but effective means for improving your CV system’s performance.

Simulation of object detection in a warehouse What are the different methods for AI/ML modeling?

Different types of AI/ML modeling methods can consume different types of training data. For the purpose of this discussion, the primary differentiator for data types is the degree to which it has been labeled. Labeling (annotating) image data provides the context the algorithm needs to learn. There are four kinds of ML modeling methods:
  1. Supervised Learning means the model is trained on a labeled dataset.
  2. Semi-Supervised Learning makes use of unlabeled data for training — usually a small amount of labeled data with a large amount of unlabeled data.
  3. Unsupervised Learning uses cluster analysis to group unlabeled data. Instead of responding to feedback, cluster analysis identifies commonalities in the data and reacts based on the presence or absence of such commonalities in each new piece of data.
  4. Reinforcement Learning is a machine learning technique that enables a model to learn in an interactive environment by trial and error using feedback from its own actions and experiences.
The most successful CV systems are usually trained from large amounts of high-quality labelled data with a supervised approach — for example, the deep learning approach. The type of learning model you choose for your project will depend largely on your use case and available resources like budget and personnel.

How can you avoid label bias when training image data?

One obstacle that can negatively impact the accuracy of a machine learning model is bias in the training data. There are several causes of bias that your team should look out for when training your ML model. Label bias is a common issue for supervised learning projects. It occurs when the dataset used to train your model doesn’t accurately reflect the situations in which the model will operate. Because it would be nearly impossible to account for every single situation an ML model might encounter, it’s important that your sample training data is not only relevant to your project, but also represents the diversity of the real world as much as possible. In other words, the distribution of your training of your training data needs to match the distribution of real-world data. To this end, it’s important to consider data distribution factors in your CV training data such as seasonal and trending signals, as well as geographical distribution of data sources. Not accounting for these variables can produce biased data.

What are some available data labeling strategies?

Computer vision technologies have seen enough adoption at this point that a variety of strategies have emerged, each with its own range of requirements and results. A generative adversarial network (GAN) is an ML technique that’s made up of two nets that are in competition with one another in a zero-sum game framework. GANs typically run unsupervised, teaching itself how to mimic any given distribution of data. This strategy is low-cost and produces large amounts of data, but it can result in noisy data quality and requires in-house AI experts to set the system up. Another method is weakly labeled data from user behavior signals. This labeling strategy can produce large data sets and reduced cost compared to other labeling approaches, but can also produce noisy data quality and requires a large number of users actively interacting with an existing AI solution while all their activities need to be carefully logged. Traditional crowdsourcing approaches to labeling data quickly produce large amounts of data for little cost, but can sometimes produce low-quality results — which can negatively impact the accuracy of your machine learning system. Active Learning is another strategy for machine learning in which a learning algorithm is able to interactively query the user to obtain the desired outputs at new data points. This approach saves on costs, while providing the most informative data for labeling — and typically results in high-quality output.

Two people with bounding boxes for machine vision How does Appen do data labeling?

Appen offers a fully managed, turn-key data labeling solution for its customers. Combining active learning with crowdsourcing, Appen employs tens of thousands of workers at any given time, resulting in rapid delivery times for labeling large amounts of data. To enable fast time-to-market for your CV project, we utilize an AI/ML-assisted, highly efficient data collection and labeling method, as well as an AI/ML-assisted project management process. Additionally, we provide a training data insight report and data augmentation services to make sure you have the best training data for your computer vision projects such as image or video annotation. The Appen solution features several key process components to help ensure the highest level of data quality:
  • Data clustering/distribution analysis and visualization
  • Data abnormality detection
  • Data bias removal strategy
  • Data automatic augmentation strategy
  • Data labeling instruction recommendation
With comprehensive, easy-to-implement data labeling and project management services, we offer an end-to-end solution that can quickly provide the foundation you need to make your CV solutions as accurate as possible. Are you currently using artificial intelligence to make smarter decisions, build innovative solutions, and deliver better customer experiences? Contact us to learn how Appen can help, or learn more about how we can help you get reliable training data for machine learning.
Website for deploying AI with world class training data