Deciphering AI from Human Generated Text: The Behavioral Approach

One of the most important elements of building a well-functioning AI model is consistent human feedback. When generative AI models are trained by human annotators, they serve as more effective tools for the end user, which in turn helps drive progress towards a brighter future. The more behavioral signals we can measure, the higher the chance we have of creating quality data.

The problem is that as AI tools continue to proliferate, human reviewers could be tempted to use them more and more to accelerate their model-training and data-labeling tasks. AI practitioners are currently debating the potential influence of incorporating AI tools into the feedback loop. That’s why it’s imperative that we find reliable ways to distinguish between AI-generated and human-generated data.

There have been a number of proposals from a range of voices about how to address this issue. Most of them focused on evaluating the final product, using watermarking or analyzing the style of the output. In the recent study, “Clicks Don’t Lie: Inferring Generative AI Usage in Crowdsourcing through User Input Events,” Appen data scientists Arjun Patel and Phoebe Liu share a different perspective. By focusing on user behavior rather than the output itself, our researchers illuminate a more effective way to identify instances where AI tools have been used by crowd contributors. Here, we summarize their findings, and share a number of insights into how we can face these origin-detection challenges.

Challenges in detecting AI-generated text

When it comes to detecting AI-generated text, there are several challenges that need to be addressed. With the rise of AI and its increasing use in generating text, it has become more difficult to distinguish between human-written and machine-generated content. This poses a significant challenge for companies who rely on accurate data annotation and labeling for their machine learning training and natural language processing tasks.

Rapid Evolution of Generative AI Models

Generative AI models, particularly large language models (LLMs), are advancing at an accelerated pace, generating text, audio, and images that are becoming almost indistinguishable from human-created content. As they become more sophisticated, widespread introduction to the market is evident and an increasing number of individuals and entities, including crowdsourced AI trainers, are leveraging these models. This swift evolution and adoption present a formidable challenge in distinguishing AI-generated outputs from human ones.

Challenges in Curation Amidst AI Expansion

In the study, Liu highlights a pressing concern, “The rapid expansion of generative AI presents a significant challenge in the curation of human-exclusive artisanal data.” The accelerated growth and integration of these models complicates the task of ensuring data remains purely human-generated, specifically when this is requested by customers.

Inherent Limitations in Current AI Detection Methods

While various strategies, like watermarking, aim to simplify the identification of AI-generated content, they come with their own set of challenges. Specifically, the effectiveness of watermarking hinges on accessing the original AI model—a requirement that’s frequently unattainable.

Research Challenges and the Evolving Nature of AI

Current research predominantly emphasizes detecting AI-produced text by pinpointing linguistic and structural nuances, such as unusual phrasing or specific patterns in sentence structures. Yet, these once-reliable markers can be easily bypassed by simple rewording, especially as AI models refine their outputs, embracing nuanced expressions, idioms, and varied styles. Even OpenAI shut down its own AI detector tool after it became clear that it wasn’t able to reliably deliver on its promise.

AI’s ability to produce content is converging with the quality of human output. The increasingly blurred distinctions between AI and human creations necessitate an immediate response. There’s a pressing need to establish dependable systems to tackle this challenge.

Appen study: How user actions reveal AI-created text

Rather than focusing solely on the text output itself, Appen data scientists Patel and Liu proposed a different approach: assessing user behavior throughout the process of text creation. The way humans interact with tools during text generation might hold the key to distinguishing between AI and human-authored copy.

For their study, Appen researchers designed experiments involving crowd workers, asking them to complete tasks in US English, under three named conditions: Human (no outside assistance of any kind allowed), Search (use of search engines like Google was permitted), and AI (use of generative AI tools was permitted).

Then, across these three types of tasks, all of these being performed by the entire cohort of crowd workers, they collected behavioral data including keystroke patterns, mouse movement, and time-related events. To support the study’s validity, crowd workers were asked to screen-record their sessions.

The idea behind the study was simple: a human writing on their own might have a different pattern of keystrokes (like the frequency of hitting the backspace) compared to someone manipulating AI-generated text. Similarly, using search engines might involve more copy-pasting, while AI use would involve less typing but more mouse movements outside the application window.

Their findings showed marked differences in user behavior across these three conditions. While operating under the AI condition, crowd workers exhibited fewer keypresses and deletions, hinting at the likelihood of using the copy-paste function. Additionally, they often navigated their mouse outside the application window, explained by the need to access an AI tool in another part of their screen. To provide a more concrete perspective, it’s worth noting that responses written without any tools had a higher median keypress proportion (0.918 ratio) compared to AI (0.374 ratio). Furthermore, responses written with AI had a lower median deletion proportion (0.028 ratio) and a higher median mouse movement off-screen proportion (0.238 ratio) than compared to human writers.

On the other hand, while operating under the Search condition, crowd workers demonstrated a tad more keypresses and deletions compared to the AI task, but not as many as when operating under the Human condition. This pattern suggests a blend of manual typing and copy-pasting.

Notably, the data made it evident that the AI and Search conditions often produced structured responses, like lists or bullet points. In contrast, the human-generated text flowed more organically and was more likely to include typos.

The future of determining text origin

The preliminary findings from the study offer a hopeful glimpse into a more reliable system for AI text detection, indicating discernible variations in behavior when crowd workers deploy external tools like these.

As we pave the way for continued exploration in this area of AI development, our team at Appen aims to expand our data collection efforts to encompass a more extensive group of crowd workers. This broadened scope is crucial to cementing the validity of our initial observations. Beyond that, we’ll analyze specific attributes of text created by our crowd contributors, cross-referencing the qualities of the copy itself with the process they used to create it.

“Understanding how AI detectors perform on real-world data is of incredible importance to downstream consumers,” explains Patel. “Verifying that the data created is indeed from humans will prevent leakage of undesirable behaviors from unseen language models to appear after fine-tuning models.”

The Rising Need for Reliable Crowd Contributors in AI Training

There will be an ever-increasing number of AI models that require training, and it’s paramount that we have dependable crowd contributors with their unique insights. As generative AI models produce content that becomes increasingly like human-made data, our focus should shift earlier in the process, targeting user behavior.

Recruitment and Behavioral Monitoring

Recruiting top-tier talent to our crowd from the outset is vital. We want to work with people who are passionate about developing AI to help solve real problems, and who take the task of training these models seriously, without looking for shortcuts. We also understand that it’s key to measure the behavior of our contributors to make sure they stay on track and avoid using AI tools where human feedback is important. Armed with the insights from our study, we can ensure that we understand where our clients’ data originated.

These practices go beyond crowd contributors who are training AI models for business. As a society, it’s imperative that we have a way of understanding where all content is originating from, no matter who — or what — is creating it. We shouldn’t be at odds with our machines: We should work with them to build a progressive future.