- Training budgets are growing: Organizations are using machine learning/AI methods in more and more places, which means more training data. For companies with under 5,000 employees, the amount of training data from 2015 to 2016 more than doubled. For companies that have more than 5,000 employees, training data jumped by 5 times.
- Changes in your business usually require getting new training data: A machine learning system only knows what it’s trained on. So if you are launching new products/services or entering new markets, you’ll want to plan for more training data in the two months after. If you can figure out a way to get relevant data before you launch, that’s even better.
- Plan for 63,000 training items per month: Remember how I started with a bunch of caveats? This is the main one. Five of the companies I’m reporting on get more than 121,000 training items per month. The lower bound is more like 14,000 items per month.
- Get a commitment to have in-house experts review categories and examples once a quarter: Businesses change over time and you want to make sure that stakeholders continue to agree on which categories are important to track and to make sure that you’re defining them consistently. This is also a good chance to show them both exemplars of the categories and some of the most difficult items.
How to pilot
Brand new machine learning projects usually create about 131,000 training items in the first quarter when they’re launched (top quartile: 309,000, bottom quartile: 12,000). Those are the numbers but more important is how you get meaningful results. The three things to keep in mind are:- Plan for pilots to be iterative—you almost certainly won’t get things right in your first go-around. Plan to launch a small subset and analyze what you get back. You’ll probably need to adjust the instructions or other parts of the experimental design. It’s worth planning for a couple of iterations of this.
- Make sure the data fits the problem—there’s some business problem to be solved, making sure that the data is appropriate is important. I know one company that wanted to mine YouTube comments for sales leads for their very very high-tech equipment. There are some interesting techniques to find needles in haystacks but there still have to be needles there to find.
- As soon as you can, schedule annotation lunches—once you understand what the project, data, and categories are, grab a conference room and in-house experts to annotate the data. Get three people to judge each item, so you can report on their inter-annotator agreement. If your experts can’t do a task, how can machines or other people do it?