Appen Machine Learning FAQ

Explore this machine learning FAQ for an overview of machine learning and artificial intelligence, including details about different methods and how you can invest.

What is machine learning?

Machine learning is the process of teaching a machine how to learn by providing it with guidance that helps them develop logic on its own and giving them access to datasets you want it to explore. The result is some form of artificial intelligence (AI).

“Despite its name, there is nothing ‘artificial’ about this technology – it is made by humans, intended to behave like humans and affects humans. So if we want it to play a positive role in tomorrow’s world, it must be guided by human concerns.”

  • Fei-Fei Li on “human-centered AI”, New York Times

How does machine learning work?

Computers follow rules. These rules are also known as algorithms. They are given an initial set of data to explore when they first begin learning. That data is called training data.

Computers start to recognize patterns and make decisions based on algorithms and training data. Depending on the type of machine learning being used, they are also given targets to hit or receive rewards when they make the right decision or take a positive step towards their end goal. As they build this understanding or “learn,” they work through a series of steps to transform new inputs into outputs which may consist of brand-new datasets, labeled data, decisions, or even actions.

The idea is that machines learn enough to operate without any human intervention. In this way, they start to develop and demonstrate what we call artificial intelligence. Machine learning is one of the main ways artificial intelligence is created.

Other examples of artificial intelligence include robotics, speech recognition, and natural language generation, all of which also require some element of machine learning. There are many different reasons to implement machine learning and ways to go about it. There are also a variety of machine learning algorithms and types and sources of training data.


Why is machine learning growing so quickly?

In recent years, there have been three things that have contributed to the widespread interest in machine learning.

  1. Growth in all types of data
  2. Declining cost of storage
  3. Massive improvements in computing power

As with anything, there is evidence of other contributing factors and business drivers, but these three advances have clearly been dominant in terms of paving the way for accelerated use of machine learning and new and innovative applications of artificial intelligence.


Why invest in machine learning?

Organizations in both the public and private sector are investing in machine learning because it allows them to improve in the following ways:

  • Speed. Get answers and perform sophisticated calculations faster.
  • Power. Process more data and conduct more complex analytics than ever before.
  • Intelligence. Uncover new insights by tapping into real-world data previously indecipherable.
  • Efficiency. Conduct more analysis with fewer human resources.

No matter what industry you are in, you will probably find a solid use case for machine learning and be able to justify the investment through anticipated return to top line and/or bottom line revenue numbers.

Machine learning has been proven to reduce and even eliminate manual data entry, detect spam, fight fraud, and recommend products. It can be used to predict when maintenance will be needed on equipment and infrastructure, it can tell you more about your customers than you have ever known before, and improve customer satisfaction.

If you have not already invested in machine learning, you need to ask yourself: why not? 


What is machine learning used for?

The use cases for machine learning are vast, diverse, and still being explored, so we’ll highlight the application of machine learning in five common fields.

Retail and eCommerce

Artificial intelligence and machine learning are being used to boost conversion rates, improve customer experience, deliver personalization and more

  • Search relevance. Online shoppers do not have the luxury of asking a salesperson where they can find a product. Your onsite search engine fulfills that role. By interpreting search queries, assessing user intent, and using that information to train your search algorithm, results become more relevant which results in higher purchase conversion.
  • Personalization. Providing recommendations to shoppers or search results based on their past behavior can help create stronger user engagement and retention.
  • Enhanced customer service. Chatbots act as a virtual shopping assistant. Like an employee, they need to be trained to know not only what you sell, but also the terminology people use for the many products on your site.


Search engines and other leading technology companies use machine learning to explicitly program their AI to deliver innovative products and improve user experience

  • Search relevance. Search engine algorithms use machine learning to drive stronger user engagement. By interpreting queries and assessing user intent, search results become more relevant, which creates higher user satisfaction.
  • Personalization. Analyzing data activity and preferences can help search engine and social media providers personalize content feeds and recommendations, enhancing online customer experiences.
  • Natural language processing (NLP). NLP can analyze language patterns to understand text that might use colloquialisms or other natural patterns on social media, for example. This technology can be used to track customer sentiment and develop engagement strategies.
  • Financial Services. Leaders in financial services use machine learning and artificial intelligence to improve customer acquisition, retention, and overall experience
  • Risk management. Anti-money laundering (AML), Know Your Customer (KYC), and fraud detection programs require sophisticated tools to spot potential threats. Relying solely on human employees to spot patterns in financial records can be both time-consuming and costly. Machine learning and artificial intelligence allow financial institutions to sift through data and find anomalies quickly, preventing illegal activity and saving potential company losses.
  • Revenue generation. Machine learning algorithms are now being leveraged by financial institutions to create investment strategies, freeing up financial advisors to engage more with their clients.
  • Enhanced customer experiences. With today’s expectation for on-demand customer service, chatbots have a crucial role to fill. Chatbots help to delight customers with real-time feedback and a streamlined experience.


Accelerate machine learning with training data for self-driving cars and improve speech-recognition systems, in-car navigation and user experience with more accurate field testing

  • Autonomous vehicles. While self-driving cars are extremely complex machines, their neural networks are powered by machine learning. As the car moves forward, it processes a lot of visual data —just like a driver does when looking out the windshield. Vehicles need to assign meaning to large volumes of image data, such as identifying a tree or pedestrian and then feed that back into the car’s AI to teach it.
  • Voice recognition. Traditional dashboards and mobile devices take a driver’s hands and eyes off the road. Speech interfaces do not. Connected cars need access to large-scale speech data collections to train the speech interface, providing consumers around the world with the best user experience.
  • Predicting behavior. Advances in voice recognition and cameras that can help track driver emotions are an important next step in Human Machine Interface, giving cars the ability to identify speakers’ emotions as well as their words—so they can tell when users get frustrated and respond accordingly.


Improve emergency response, defense initiatives, and law enforcement with secure data services

  • Defense. Using social media monitoring, computer vision, and data annotation, government agencies are now able to extract information to aid with terrorist surveillance, monitor national security threats, and more.
  • National emergencies. Emergencies like natural disasters or coordinated attacks can happen without a moment’s notice. When lives are at stake, responding immediately and with coordination is key. Using translation, voice recognition, and text data collection, emergency responders around the world can communicate efficiently using machines with those in harm’s way.
  • Law enforcement. Secure transcription allows law enforcement to accomplish many objectives, including capturing files from Body Worn Video, official record-keeping, and archival record solutions.


Exciting uses of artificial intelligence (AI) and machine learning in Healthcare are transforming patient care

  • Predictive analysis. Evaluate trends, anticipate outbreaks, and forecast patient needs.
  • Chatbots and virtual healthcare. Provide faster and better customer service.
  • Advances in underwriting. Use machine learning to build stronger underwriting models based on a wide variety of data points.

“Most of human and animal learning is unsupervised learning. If intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake. We know how to make the icing and the cherry, but we don’t know how to make the cake. We need to solve the unsupervised learning problem before we can even think of getting to true AI.”

  • Yan Lecun, Director of AI Research, Facebook


What are the top machine learning methods?

Supervised Learning

Supervised learning algorithms are designed to determine predictive models based on examples—or training data. These datasets contain input variables paired with correct output variables. The algorithm is then tasked with analyzing the data and producing a function that accurately maps the inputs to their corresponding outputs. Once trained, the algorithm can go on to predict the results for any new data it is given.

  • Classification – Classification is the easiest to understand. The data is evaluated to determine which class it falls into. An example might be a machine learning model that asks a machine to determine if a picture is of a horse or not. That is a simple yes/no response and an example of binary classification. After providing training data with enough pictures of horses and non-horses that the machine can learn the distinguishing characteristics of a horse, the machine will be able to look at a picture on its own and tell you if it is a horse or not.
  • Regression – Instead of separating the data and assigning it to a class, the machine is asked to predict a response or output based on the responses it got from the initial training data. An easy to follow example is if the initial inputs of 3 and 5 had a target of 8, the learned logic would be to add the two inputs. Ultimately the model would use regression analysis to predict the target for inputs of 4 and 6 to be 10. Supervised learning is task-oriented; i.e. “Find me XYZ target.”

Semi-Supervised Learning

Semi-supervised learning is a hybrid model. Algorithms using semi-supervised deep learning are trained on a combination of labeled and unlabeled data. This approach can be more practical because it can be expensive to have a data scientist or data engineer label data. Other times this approach is taken because the size of the data is so massive the task of labeling it would be herculean. Another reason teams take a hybrid approach is to avoid any sort of human bias that can happen during data labeling.

“It is a capital mistake to theorize before one has data. Insensibly, one begins to twist the facts to suit theories, instead of theories to suit facts.” Sherlock Holmes

With semi-supervised learning, your model may benefit and be able to work faster by having some targets or labeled data, and the work it does to make sense of the unlabeled data may reveal insights and provide you with outputs you hadn’t discovered yet. It is a win-win in many scenarios and an often used approach.

Reinforcement Learning

Reinforcement learning is the most abstract approach and based entirely on the machine, often referred to as the “learning agent”, learning through trial and error. The machine determines which actions to take in order to maximize its performance in a given environment based on a definition it has been given of a reward. That kind of trial and error activity is called exploration. The knowledge it gains from understanding which actions earn rewards is called exploitation.

Through exploration and exploitation of its environment, the learning agent, fueled by advanced machine learning algorithms, ultimately gains enough knowledge to begin to demonstrate almost human-like levels of artificial intelligence.

Robots provide the best example of reinforcement learning. Their use in factories relies heavily on their ability to use reinforcement learning to adapt as needed to their environment and complete human-like tasks and behaviors with continually-improving error rates.


What kind of data do you need for machine learning?

“Machine learning can only be as good as the data you use to train it.”

  • Daniel Tunkelang, Led machine learning projects at Endeca, Google, LinkedIn

There is no end to the number of articles that speak to the importance of making sure you have enough of the right data to support your machine learning projects.

As Tunkelang, quoted above, goes on to explain in the article Machine Learning: 10 Facts Everyone Needs to Understand, “you can have machine learning without sophisticated algorithms, but not without good data.”

So what kind of data do you need? It depends.

Structured vs. Unstructured Data

  • Structured Data – Structured data is logically organized and easy for a computer to read and understand. It could be machine-generated transactional data pulled from an ERP or CRM system or simple time-stamped data about actions coming from sensors. It could also be human-generated data input into a spreadsheet. This type of data is most often used in supervised learning and it can typically be processed very quickly, even with incredibly large volumes.
  • Unstructured data – According to industry leaders more than 80% of the data in the world is unstructured and the amount of data is growing exponentially. Unstructured data is everywhere. Human-generated unstructured data includes MS word and other text files, presentations, videos, images, audio, social media posts, and much more. Examples of machine-generated unstructured data include surveillance footage, satellite imagery, and scientific data. Supervised and reinforcement learning are incredible tools that can be applied to gain insights and do more with unstructured data than ever before.


How much data is required for machine learning?

The short answer is: a lot. The best algorithm in the world will struggle to yield the right results with insufficient data.

“AI techniques require models to be retrained to match potential changing conditions, so the training data must be refreshed frequently. In one-third of the cases, the model needs to be refreshed at least monthly, and almost one in four cases requires a daily refresh.”

Why? Greater volume drives greater accuracy.

There are many reasons for that. One reason is that for most machine learning models, you are trying to get a computer to make sense of a data-set with an incredible amount of variation.

As an example, consider voice recognition applications and variation in speech caused by differences in gender, age, dialects, and more. Some experts say that a model needs at least 10,000 hours of audio to deliver outputs with modest accuracy levels. Others say that while the total volume of data required depends on the complexity of the model or the problem, 100,000 instances is a minimum requirement for most models.


Does “quality” matter?

Yes! Maybe even more than quantity.

“More data beats clever algorithms, but better data beats more data.”

  • Peter Norvig, Computer Scientist, Google and Industry Leader

What makes data “bad?” It could be irrelevant to your problem, inaccurately annotated, misleading, or incomplete. In these cases, it will require some data cleaning or preparation.

If your model is tasked with classifying data, your training data may have to be properly labeled first. Sometimes formatting is an issue. For example, if you are working with image data those images may need to be resized so the model analyzes vectors of the same length.

Any data that you use will require some clean-up. Experts report that the work that needs to be done does not end with the extracting, transforming, and loading (ETL) of data. Even after that, the clean-up required to make it suitable for data science typically represents an average of 80% of the total workload in any machine learning project.


Machine Learning FAQ Additional Resources

As use cases continue to expand, you will want to stay up to speed on all the ways you can improve your models and create better products for your customers.


Machine Learning Glossary of Terms

  • Artificial intelligence (AI) – The ability of machines to operate independently to perform tasks and activities that typically require the intelligence of humans.
  • Chatbot – A chatbot is a virtual assistant that communicates with humans by simulating typical conversation threads. It is typically delivered over the internet and embedded into a website or mobile app.
  • Data classification – Data can be classified by humans or machines and is the process of allocating specific categories to data that exhibit the same characteristics, e.g. date, source, type, etc. The goal is to make the data easier to understand and analyze or use.
  • Data labeling – Data labeling is performed by humans and it is the process of adding labels that provide machines with targets used in supervised machine learning models.
  • Machine learning – Machine learning is the process of teaching machines how to learn by providing them with guidance that helps them develop logic on their own and access to data you want them to explore.
  • Reinforcement learning – when a machine or agent is given a dataset, a set of rules for how to explore that data, and a clear understanding of when it will be rewarded for its performance. As it explores the data and its “environment” the machine learns through trial and error the most efficient and effective methods for earning rewards and achieving its objective.
  • Structured data – Only 20% of the data in the world is considered to be “structured.” Structured data is organized in a fashion that makes it easy for computers to analyze and interpret. It is typically found in relational databases, spreadsheets, and enterprise systems such as CRM, ERP, and Financial applications.
  • Supervised learning – Supervised learning models are some of the simplest and most accurate instances of machine learning in use today. With supervised learning a machine is provided a structured set of data that includes inputs and data that has been labeled as the “target” data or desired output. The machine learns from these examples what logic was used to transform the inputs to outputs so eventually it only needs the inputs and it can create the target outputs independently.
  • Training data – Training data is the data used within a machine learning project to begin the process of teaching the machine the logic, behaviors, or other forms of intelligence targeted for the project. Once the model has consumed enough data to work, it is given test data and before the project is declared a success it is run with validation data.
  • Unstructured data – 80% of the data in the world and is not organized in a fashion that makes it easy to interpret or analyze. Examples include text and chat messages, recorded audio, videos, and social media posts.
  • Unsupervised learning – the data the machine is given has not been labeled. It is the job of the machine and the model to find the correlations, patterns, or relationships among the data and deliver those insights as output.
Website for deploying AI with world class training data