How to Approach Data Collection for Conversational AI Agents

Training Conversational AI Agents on Noisy Data

Chatbots, virtual assistants, robots, and more: conversational artificial intelligence (AI) is already highly visible in our daily lives. Companies looking to increase engagement with customers while reducing costs are investing heavily in the space. The numbers are clear: the conversational AI agents industry is expected to grow 20% year over year through at least 2025. By that time, Gartner predicts that organizations that leverage AI in their customer engagement platform will increase operational efficiency by 25%. The global pandemic has only accelerated these expectations, as conversational AI agents have been critical to businesses navigating a virtual world while still wanting to remain connected with customers. Conversational AI helps companies overcome digital communication’s impersonal nature by providing a tailored, humanized experience for each customer. These changes redefine the way brands engage and will undoubtedly become the new normal, even post-pandemic, given the successful proof-of-concept. Building conversational AI for real-world applications is still challenging, however. Mimicking the flow of human speech is extremely difficult. AI must account for different languages, accents, colloquialisms, pronunciations, turns of phrase, filler words, and other variability. This effort requires a vast collection of high-quality data. The problem is, this data is often noisy, filled with irrelevant entities that can misconstrue intent. Understanding the role data plays and the mitigation steps to manage noisy data will be essential toward reducing errors and failure rates.

Data Collection and Annotation for Conversational  AI Agents

To understand the complexities of creating a conversational agent, let’s walk through a typical process for building one with voice capabilities (such as Siri or Google Home).
  1. Data Input. The human agent speaks a command, comment, or question captured as an audio file by the model. Using speech recognition machine learning (ML), the computer converts this audio to text.
  2. Natural Language Understanding (NLU). The model uses entity extraction, intent recognition, and domain identification (all techniques for understanding human language) to interpret the text file.
  3. Dialogue Management. Because speech recognition can be noisy, statistical modeling is used to map out distributions over the human agent’s likely goal. This is known as dialogue state tracking.
  4. Natural Language Generation (NLG). Structured data is converted into natural language.
  5. Data Output. Text-to-speech synthesization converts the natural language text data from the NLG stage into the audio output. If accurate, the output addresses the human agent’s original request or comment.
Let’s explore NLU a bit further, as this is a critical step in managing noisy data. NLU typically requires the following steps:
  1. Define Intents. What is the human agent’s goal? For example, “Where is my order?” “View lists” or “Find store” are all examples of intents or purposes.
  2. Utterance Collection. Different utterances working toward the same goal must be collected, mapped, and validated by data annotators. For example, “Where’s the closest store?” and “Find a store near me” have the same intent but are different utterances.
  3. Entity Extraction. This technique is applied to parse out critical entities in the utterance. If you have a sentence like, “Are there any vegetarian restaurants within 3 miles of my house?”, then “vegetarian” would be a type entity, “3 miles” would be a distance entity, and “my house” would be a reference entity.
Given these steps, what are the challenges in designing dialogue? First, there’s no straightforward way to collect human intents in a way that’s universal for everyone. Second, it’s difficult to model real-world conversation flow, which will vary by geography, age, person, and other individual factors. Finally, data collection can be noisy and costly. A lot of automatic speech recognition (ASR) data contains noise, where the machine misunderstands specific words or phrases in the audio file. An example is, “I would like one,” becomes, “I would like I’m on,” which is meaningless. Human speech is natural and unscripted; we often use filler words that are irrelevant to our intent. “Oh yeah, I think, yeah, this is better,” has many unneeded filler phrases that can cloud the interpretation of meaning. Humans also have a high variability of phrasing, depending on their location, upbringing, and experience. When we look at the stats on noisy data, we find that AI is either correct or making minor errors in an average of 53% of cases. In 30% of cases, AI makes minor errors. In 17% of cases, AI is making significant errors, demonstrating that noisy data is still a problem for businesses launching conversational AI agents.

Designing Dialogues for Social Robots

data collection for conversational ai agents In many cases, a conversational agent’s goal is to enable them to interact with humans as peers, not as devices. This means communicating using speech and gesture, providing useful services, and leveraging natural language to maintain a natural conversation flow. How do we then develop social robots that can interact with people? One way to approach creating a social robot with personality is through flowchart-based visual programming. Flowchart blocks represent back-end functions, such as talking, shaking hands, and moving to a point. They catalog the flow of interaction. Content authors can use the flowchart to easily combine speech, gesture, and emotion to build engaging interactions. Erica (the ERATO Intelligent Conversational Android) was built using this method. Her content authors iteratively added content over months to develop her as a character and not just a question-answering device. She can now complete over 2,000 behaviors and over 50 topic sequences. Another approach to designing a social robot is teleoperation. The Nara Experiment employed a robot at the Nara, Japan, tourist center to act as a tour guide for visitors. Human tour guides created offline content for the robot (for example, background information on the local Todaiji Temple), and engineers programmed the robot with the information ahead of time. The team contrasted this method with teleoperation. When a human-in-the-loop teleoperator controlled the robot remotely, results were more accurate than when the robot relied on offline data. The problem was the method wasn’t very scalable, content entry was slow and error-prone, and it was challenging to control multimodal behaviors. While interesting case studies, these experiments prompt questions around more scalable alternatives to dialogue design. Would it not be more efficient to collect in-situ data from real human-to-human interactions?

Learning by Imitation for Social Robots

If we could crowdsource human behaviors, we could collect higher-quality data more passively and cost-efficiently. We could observe human interactions, abstract typical behavior elements, and generate robot interactions based on this. One such team explored the validity of this idea by setting up a camera shop scenario. Let’s walk through their methodology:
  1. Data Collection. The team collected data on human customers’ multimodal behaviors and shopkeepers, including three critical categories of speech, locomotion, and proxemics formation.
  2.  Speech: Using automatic speech recognition, the model captured the typical utterances (for example, how many megapixels does this camera have? Or what is the resolution?) and used hierarchical clustering to map these utterances intents.
  3. Locomotion: Sensors captured tracking data on typical locations where humans congregate, such as the service counter, and distinct trajectories, such as from the door to the camera display. Clustering was used to determine the frequencies of each position and trajectory.
  4. Proxemics Formation: Sensors captured typical formations of customer and shopkeeper; for example, face-to-face, or the shopkeeper presenting a product.In addition, when a customer spoke or moved, that interaction was discretized into customer-shopkeeper action pairs.
  5. Model Training. The team then trained the model using the customer action (including the utterance, motion, and proxemics) and labeled data of the shopkeeper’s expected response. For example, the customer action might include asking, “How much does this cost?” while facing the shopkeeper; the shopkeeper would then reply, “It’s $300.”
After the team trained the model, they tested the robot on the camera shop floor and accurately handled 216 various interactions. While a long way off from being a human replica, the robot in this case study demonstrates the complexities involved in attempting to mimic human speech and behavior.

Moving Forward with Conversational AI

What do we take away from these examples? Building conversational agents are difficult. Data is noisy and difficult to capture, and imitating human language is a formidable challenge. That’s why it’s essential to design data collection workflows to capture high-quality data. Using an in-situ approach for data collection is best for capturing natural conversation, although more progress is still needed to reduce the error rate further. The problem of noisy data continues to be constant. Using ML-assisted validation to reject noisy utterances from the onset and leveraging abstraction and data-driven techniques can reduce noise. Unlocking the business value of conversational AI agents will mean investing heavily in data and developing more accurate ML approaches to solving the natural language problem. At Appen, we have been helping companies successfully create their conversational AI agents, getting them from experiment to full deployment by helping them navigate the complexities of data collection and annotation.
Website for deploying AI with world class training data