The Role of Data in Responsible AI: Data Decisions that Shape the Future of Ethical AI

There’s no question that artificial intelligence (AI) will continue its rapid evolution in the coming years and become increasingly interconnected with our daily lives. The onus is now on companies to approach AI with a responsible lens in order to maximize transparency, reduce bias, and guide ethical applications of the technology. AI that works well, after all, works equitably for everyone. Decisions on responsible policies and protocols now will determine the future of AI and in turn, how AI will shape our future. Data plays a foundational role in these efforts; it’s at the core of every AI technology directly influencing model performance. A model is truly only as good as the data used to train it, which is why data is a key area where AI practitioners can really make a difference when determining governance practices. The Role of Data in Responsible AI

It’s All in the Data

When working on an AI project, data scientists will spend the majority of their time on data collection and annotation. In working through these tasks, there are three areas of utmost importance: protecting data privacy, mitigating bias in the data, and ethically sourcing data.

Data Privacy

As an AI practitioner, a top concern should be data privacy and security. There’s already legislation in this area, which your organization’s data handling protocols should remain consistent with. For example, ISO standards (which are internationally-recognized) exist around protecting personal information, the GDPR (General Data Protection Regulation) covers data management in the EU, and other requirements are present worldwide. Your business must follow standards respective to all of the locations it has customers in. In some areas of the globe, data protection regulations may be inconsistent or not present; regardless, committing to responsible AI means adopting data security management practices that protect your data suppliers. You should seek consent from individuals before using their data and implement security measures to protect any personally identifiable information from being used inappropriately. If you’re unclear on what types of security protocols you should incorporate into your data management practices, you may consider working with a third-party data provider that already has these in place and has the expertise to guide you through safe data processing.

Data Bias

Biased data = biased outcome. It’s a simple fact of AI development, but becomes much more complex when you imagine all of the ways in which bias can be unintentionally introduced into AI models. Let’s take an example: you’re building a speech recognition model, perhaps for usage in a car. Speech itself can have various tones, accents, filler words, and grammar (not to mention, different languages and dialects). Assuming you want your speech recognition model to work for drivers of different demographics and backgrounds, you’ll need data that represents each of these use cases. If you collect data with mostly male voices, your speech recognition model will generally have trouble recognizing female voices—and in fact, this is exactly what has happened with a few popular speech-based products—because the model will have not been exposed to enough of that type of data during training. The challenge, then, is curating a dataset that’s complete and fair; one that covers all use cases and edge cases. Creating an AI product that works the same for every user starts with ensuring all of those users are represented in the training data.

Data Sourcing

In this context, we’re talking about ethical sourcing of data with respect to the treatment of the people who provide and prepare the data. Ideally, if you provide data, you should be compensated for it (and be aware you’re providing it). Compensation can be in the form of money or services exchanged. The reality is that a lot of data is harvested without us knowing and often, the line is blurry on who even owns the data. If you’re on a video call for your company, for example, who would own the voice data produced from that call? Your company? The video call provider? The individual speakers? Boundaries can get confusing quickly. In any case, companies committed to responsible AI should be transparent about who they’re collecting data from, what kind of data, and when, and make an effort to compensate the individuals for their data appropriately. Procuring data isn’t always the problem, though. Getting the data into a usable state is frequently a challenge. You need many people cleaning and filtering data to ensure it’s valuable to your project, and then you’ll need more people to annotate that data with accurate labels. These people must be provided fair treatment: that includes fair pay, open lines of communication, confidentiality, and comfortable working conditions. The legislation in this space is mostly with regards to laws prohibiting modern slavery, but companies can do a lot more to ensure their data annotators are treated ethically. At Appen, for instance, we rely on our global crowd of workers for high-quality annotation, and have created a Crowd Code of Ethics to document our commitment to their well-being.

Shaping the Future of AI with Data

Companies have a responsibility to make AI decisions today that will drive positive outcomes for both businesses and society in the future. Data governance in particular has a significant impact on the overall ethicality of any AI endeavor, as data bias and data management are key players in responsible application of the technology. As an AI practitioner, your goal should be to set up a data governance framework that reflects key tenants of responsible AI. In doing so, you’ll be contributing to a more fair technology, one that better reflects the diversity of our society.
Website for deploying AI with world class training data