Bias it’s everywhere, you can’t avoid it and it isn’t always a bad thing. You can liken data bias to personal preference. Say you and your friend are getting ice cream and you love chocolate, your preference, or in this case your bias, is that chocolate ice cream is superior, so you order that. Yet your friend doesn’t like chocolate and gets mint. You like your choice and think it’s better, but not everyone in the world has the same ice cream flavor preference as you.
Extending that concept to the world of tech, it’s important to acknowledge bias when sourcing and collecting data to train AI models. We all have different preferences so the AI models we develop need to account for all possible options in the world. Diversity in data makes it so technology using the AI model works for every person and all users and preferences are represented in the possible outcomes.
The phrase less is more doesn’t apply to training data. When a model is trained by a limited dataset, bias occurs. If a significant portion of the data comes from just one data source the chance that the data isn’t fully representative is high. If you have an AI model that needs data covering several topics, and the data provide focuses heavily on one topic, then the AI model will learn that one topic is likely the solution, when in reality it probably isn’t. By providing excess data on that one topic, bias was created to indicate that topic is almost always the solution.
Every aspect of the AI lifecycle presents opportunities for us to fine tune our model and prevent bias from occurring. That’s why one of the biggest conversations happening around AI is ethics. The goal of ethical AI is to make artificial intelligence algorithms and automations that are globally representative, equitable, and used responsibly. There are a multitude of facets to the ethical AI conversation, and it all starts with ethical data sourcing.
We’re continuing our series of articles about AI data at every stage of the AI lifecycle. The four stages of the AI data lifecycle are:
- Data sourcing
- Data preparation
- Model training and deployment
- Model evaluation by humans
As we continue to explore ethics at each stage of the AI data lifecycle, we will include ways in which we ensure we’re meeting our ethical goals. Not all data is created equal, and as more and more companies scale AI projects, using the right, ethically sourced data becomes even more important to success.
The focus of this article is data sourcing and how making every effort to eliminate bias in the sourcing stage can set your model up for success.
Ethics!? What is it and Why it Matters
Ethical questions about data begin with sourcing and continue through how the data is being used. When we say “data ethics,” we’re talking about how data is gathered, where and who it’s gathered from as well as how the data will be used, and who will have access to it. A big concern in data ethics is the use of private or personally identifiable information (PII), especially protected information such as health or financial information. You can’t just gather PII data from any source, you need special permission to collect it and have to abide by strict rules to not violate any privacy laws. By ensuring data is ethically sourced and stored properly, you’re protecting people’s personal data from being shared publicly.
How to Ensure Your Data is Ethically Sourced
The first step to data sourcing is to identify the specific topic of information you need your data to contain, once that’s figured out you need to ask yourself the following questions.
- Where is the data coming from?
- Who will see and work with the data? Is it free from PII?
- How will data the be used in the future?
Basically, you need to know where the data you’re working with comes from. Make sure it’s coming from a variety of sources, so your data isn’t heavily focused on just one aspect. Chances are the data you sourced will be looked at by a variety of people working to create the AI model. You’ll want to ensure PII is removed so that people’s protected info is kept safe. Ethical AI isn’t just about keeping data bias free, but also about only having the data you need, if data is sourced that isn’t needed for the project it can jeopardize your AI model or even your company as people will think you have ulterior motives for collecting the data.
What Happens After Sourcing?
After all the necessary data is gathered, it’s vital to go over it and check to see if there are any gaps in the data. Going back to the ice cream example, let’s say your friend who likes mint made an AI – powered app to suggest ice cream flavors based on a user’s mood. If they primarily used data on ice cream that was mint flavored or variations of it and used less data on chocolate because they didn’t like it, then the app would primarily suggest mint ice cream for most moods. While mint can be the answer to certain ice cream moods, the app clearly doesn’t account for people who enjoy other ice cream flavors. This results in an app that won’t work for a significant portion of people who otherwise would have enjoyed using it.
By reviewing the data to check for gaps, you can gather more data in areas that are lacking. This will ensure your completed model works for anyone who would like to use it.
Use Synthetic to Supplement Your Data Needs
Whether you’re having trouble finding certain datasets to train your model or want to find a sure-fire way to avoid using data that contains PII, synthetic data is an ideal alternative. Synthetic data is artificially created rather than captured from real world sources.
Synthetic data is quickly becoming an ideal solution, not only for generating PII free data, but also for creating data for scenarios that may not have happened in the real world yet. It can also be generated faster, which can reduce the total time needed in the data sourcing stage and enable you to have a completed AI model sooner.
At Appen, one of our founding principles is supporting the creation of artificial intelligence with ethical data. With our work and expertise in AI data, we find it important to share what we’ve learned so we can help to make the AI landscape a better, more ethical space. Read the first article in our ethical AI series focused on data preparation to learn more.