Good AI Starts With Good Data

This post originally appeared on App Developer Magazine.

Nowadays, it seems like every company is doing something with AI - or if they’re not, they’d like to be. The technology promises to improve the way we work and live, and industries ranging from manufacturing to retail, to inspections, and everything in between are grappling to build their own AI solutions. But where to begin?

I like to say that AI is like cooking - it’s all about the ingredients. Without good ingredients, even the best recipes will fall flat. The same goes for AI, but in this case, the ingredients are your data. If organizations don’t take a close look at the data they need to develop an AI solution and ensure it’s prepared and organized effectively, AI solutions will be riddled with inefficiencies - whether the result is biased algorithms, ineffective solutions, or AI that simply doesn’t work.

High-functioning AI begins and ends with good data.

Data: The Good, The Bad And The Ugly

One of the biggest challenges of training deep neural networks (DNNs) is the cumbersome process of training them - AI systems don’t just need data to learn about the world, they need hundreds of thousands of times more data than humans do.

Luckily, we humans are currently producing 2.5 quintillion bytes of data each day. The internet is an absolute data gold mine. Unluckily, most of it isn’t fair game, because people are generally unwilling to share their personal data, even if it does mean building better AI systems.

And, if you’re lucky enough to overcome the hurdle of having enough data, there is still the question of quality. Not all data is created equal. To recognize an object or behavior, AI must be trained on data in all different conditions, from various angles and the like. Otherwise, algorithmic bias is inevitable.

As data scientist Daniel Shapiro details in a recent post, there are many different data quality pitfalls, including data sparsity, data corruption, irrelevant data, missing important patterns, wrong patterns, and bad labeling.

The Right Data For Computer Vision Solutions

The most successful companies are the ones who are able to break down data silos across their organizations and gather a holistic view of the data they have available. Once they have done this, they are able to create processes for augmenting their data to achieve a level necessary for a productized solution.

This is where the good data lives: they own it, and it is perfectly suited to their specific use case.

People often ask me how much data is needed to create a meaningful solution. Our rule of thumb for a given use case is that 1,000 images/class is the barrier to entry, and, in order to reach production level accuracy (90%+), 5,000–10,000 images/class are required.

However, the issue of quality - even when it seems like there’s an ample quantity of data - prevails. I’ve seen examples of this in the inspection industry, where I’ve been amazed at how many of their images focus on only one angle of an object or are taken in only one specific lighting condition. Photos like these aren’t going to give their AI-powered drones the information they need to do their jobs.

In other words, bad photos equal bad drones.

When Good Images Go Wrong

But it’s not just the quality of the photos themselves that matter; there is ample opportunity for good photos to be botched during the tagging process.

Because AI applications require thousands of images to be tagged, humans can tag poorly or introduce errors - especially because current tools are simple picture editing tools, like Microsoft Paint, which weren’t built for this purpose. Even small imprecisions, compounded over thousands of images, can have a large impact on the accuracy of a computer vision model. And if you think about a production-grade product or solution, every percentage point increase in accuracy can have a big impact on the organization.

It’s also worth mentioning that, since data tagging cost is proportional to the time spent tagging, this step alone often costs tens to hundreds of thousands of dollars per project.

A Good Tagging Tool Is The Key Ingredient

I recently attended a webinar about implementing AI for inspection services. The host spoke about how they are paying fifty to one hundred dollars an hour to have civil engineers do annotation and classification work. They felt they needed industry experts tagging the images, but it was costing them a huge amount of money, and it was their biggest bottleneck.

Data-labeling services like Scale API, Mighty AI, and CloudFactory, which contract with hundreds of labelers, often overseas, are a much more efficient and cost-effective alternative. Companies looking to handle their tagging internally, meanwhile, need a precise, automated, purpose-built annotation tool.

A(I) Recipe for Success

Engineers often refer to AI development as a “sprint,” striving to rapidly test, iterate and deploy AI. But, AI is deeply rooted in research, and the reality is that traditionally, there’s a long road to production. With the right data tagging tools, though, rapid testing is within reach - in turn, enabling faster iteration and deployment.

Investing in the best tools and the right people to accurately and efficiently annotate your data will make a huge difference in the success of a production-grade AI solution. And - with any luck - this “recipe” for data tagging and AI app development success will keep your customers coming back for seconds.