The Data Crisis in the AI Economy

Data is the lifeblood of artificial intelligence. It plays a central role in the development and efficacy of AI systems, fueling their ability to learn, adapt, and make informed decisions. However, the availability of natural data, which is crucial for AI systems to improve, is becoming increasingly limited. Natural data refers to information derived from the real-world environment and is unprocessed or minimally processed. Unlike synthetic data, which is artificially generated or manipulated, natural data is raw and unaltered, allowing AI systems to learn and adapt from authentic scenarios and situations. This data encapsulates the complexities and variability of real life, making it critical for improving the accuracy and versatility of AI systems. This scarcity of natural data poses significant challenges for AI companies, potentially leading to a reckoning if the data crisis is not effectively addressed.

According to Alice Desthuilliers, a Sr. Product Manager at Appen, the value of natural data lies in its imperfections, approximations, and varied tones. “It’s interesting because we usually seek perfect data, but when algorithms tackle real-life problems, they must understand that real life is far from perfect.”

She further explains, “during a recent collaboration on developing prompts and answers, the client emphasized the importance of grammatical perfection for their brand image. However, prompts were considered less significant, and it could even be beneficial for them to reflect on how people truly write and think. This includes typos, abbreviations, and unclear syntax, among other things.

Researchers have been sounding the alarm about the finite nature of natural data for nearly a year, and the implications are sobering. We have seen the warning signs that this dwindling supply of natural data could have adverse consequences and even impede the growth of the industry as a whole. In the world of infinite possibilities of AI, the nature of data is a fossil fuel issue. Finding innovative solutions to replenish the natural data supply has become a pressing concern for the future of AI.

The Looming Data Shortage

The AI industry has grown exponentially in recent years, driven in part by the accumulation of vast amounts of data, the advancement of Large Language Models (LLMs) as well as hardware enhancements enabling advance compute capabilities. However, as Rita Matulionyte, an information technology law professor at Australia’s Macquarie University, points out in an essay for The Conversation, AI researchers have been warning about the shortage of high-quality, natural data sources for nearly a year. According to one study by Epoch, AI firms could run out of such data as early as 2026, with low-quality text and image data wells potentially running dry anytime between 2030 and 2060. This is a serious issue to AI companies, given the huge amount of data needed to train and improve their models.

The shortage of natural data is a precarious situation for the AI industry. Models have advanced tremendously as developers have poured in more and more data. If the supply of natural data stagnates, so too might the models and, by extension, the industry itself, whether for-profit or non-profit. The scarcity of natural data sources could hinder AI models’ development, which could significantly impact the business world, where companies are racing to develop and integrate AI solutions into their organizations.

The Growing Demand for Data in the AI Economy

With the rise of AI technologies and generative AI applications, there has been an exponential increase in the demand for large quantities of data. This is because AI systems need vast amounts of diverse data to learn, make predictions, and improve their performance over time. Unfortunately, natural data is not only limited but also subject to strict regulations and privacy concerns. As a result, companies are turning to alternative data sources, such as synthetic or simulated data, which may not accurately reflect real-world situations. This poses a significant risk to the reliability and effectiveness of AI systems.

This insatiable demand for natural data has also led to concerns over privacy and ethical issues. The Cambridge Analytica scandal exposed how data mining and manipulation can be used for political purposes, raising questions about the ethics of using personal data for AI development. As consumers become more aware of their privacy rights and governments begin to regulate, there is a growing pushback against companies that collect and use consumer data without consent.

The Threat of Bias

A consequence of the scarcity of natural data is the potential for AI systems to develop biased algorithms, which can lead to discriminatory decision-making and perpetuate existing social inequalities. In the absence of diverse and inclusive datasets, AI systems may learn from biased data and replicate those biases in their decision-making processes. This could have detrimental effects on society, ranging from discrimination in hiring practices to unfair treatment in legal proceedings. The lack of natural data further exacerbates this issue, making it difficult to identify and address biased algorithms.

It is crucial to prioritize data diversity and inclusivity to ensure the responsible development and deployment of LLMs that uphold fairness and equity in their decision-making processes.

Techniques to Mitigate the Data Crisis

Currently, there’s a belief that an abundance of data results in more powerful models, which may not be the case. Research indicates that well-curated data holds greater efficacy in training models when compared to vast quantities of low-quality data.

For example, some data-intensive AI companies leverage synthetic data to train new models to address the issue of data scarcity. Theoretically, this technique resolves the shortage by replacing large volumes of natural data with machine-generated information. Nonetheless, it poses potential risks:

The generated data is often suboptimal, leading to biased models that underperform in real-world scenarios.
Synthetic content can corrupt a given model, compromising its performance instead of enhancing it.
The integrity of AI-generated data could easily be called into question, given that even AIs trained on human-generated material are known to make major factual errors and mistakes.
Repeated use of fabricated data can cause an erosion of data quality and lead to model to failures.
Employing a machine to train another machine to react to real-life data prompts raises ethical concerns. It may lack representation for emerging trends, mercurial consumer habits, current events and niche applications.

While a useful stopgap, synthetic data does not comprehensively resolve the data supply issue. It may prove effective for specific applications, such as facial recognition, but may not yield the desired results in areas like natural language processing. As a result, businesses must adopt a more targeted approach to data generation, prioritizing quality over quantity. Some experts argue for using less data per model, while others propose a combination of naturally occurring and synthetic data is more appropriate.

Identifying alternative sources of natural data and extracting valuable insights from them represents another potential solution. For example, customer feedback data can aid in training LLMs in natural language processing (NLP). By analyzing customer interactions, businesses can develop LLMs that can understand and respond to user queries, facilitating more efficient customer service and support.

Some companies are even exploring opportunities to collaborate with other firms to share data resources presents a viable option. This approach was recently successfully implemented by Uber, which open-sourced its dataset for autonomous driving research, enabling developers worldwide to contribute and make improvements. Such collaborative approaches have the potential to democratize data access and provide greater volumes of much-needed representative training data for LLM developers.

A Tried-and-True Solution

As we navigate the data crisis, Appen emerges as a reliable partner, offering a multitude of solutions tailored to meet the specific needs of each industry partner. Our structural strength, extensive manpower, and domain expertise equip us to build customized datasets for niche domains or even rare languages. We are committed to mitigating the data challenges of today, driving AI initiatives forward, and paving the way for a robust, data-driven future.

AI continues to revolutionize industries, and the demand for natural data will only increase. It isn’t just a technical issue—it’s a business issue. Firms that take the lead in addressing the data shortage will be the ones that thrive in a data-driven economy. As responsible and ethical creators of artificial intelligence, we must address the critical issue of data scarcity. This includes being proactive in identifying potential sources of natural data, investing in targeted data generation techniques, and collaborating with other firms to democratize access to data resources.

Businesses must take steps to mitigate bias in their AI systems by prioritizing diverse and inclusive datasets. By doing so, we can ensure that AI remains a tool for progress and not perpetuation of social inequalities. As an industry leader in data annotation and collection, Appen is committed to providing cutting-edge solutions that address the data challenges of today and pave the way for a more equitable and inclusive future. The responsibility to create unbiased AI lies with all of us.