How to train your AI algorithm

Successful AI algorithms are built on a foundation of training data, but sourcing data that fits your needs and meets volume requirements is harder than you might think. Particularly when it comes to developing AI-driven applications and smart voice assistants. Richard Downs, Director Northern Europe, Applause.

  • 3 years ago Posted in

Businesses face several challenges when it comes to training their algorithms to respond to real-world scenarios. Sourcing data at scale is extremely challenging. Businesses need to be able to leverage large and diverse samples, or crowds, of people representative of their target market. It takes a dedicated resource to deliver projects of this scale. In effect, a crowdtesting (or distributed testing) solution, which provides businesses with access to a global community of skilled testers who work remotely. This model provides an embedded infrastructure that can be scaled up or down to meet requirements.

Enterprises and consumer brands have been using crowdtesting services for over a decade. Crowdtesting has become a well-established model that operates in tandem with in-house teams to complement integrated QA testing. Traditionally used to test apps, websites and other digital properties, crowdtesting has become integral to sourcing the data needed to train AI algorithms. It provides businesses with the scope and scale they need to bring new AI applications to market.

Despite the advantages this model offers there are still a number of challenges businesses need to address. Here we explore three of the key challenges businesses face when sourcing training data.

1. Quantity of data sources

Enormous amounts of data are required to develop an effective algorithm. In the case of training a smart voice assistant developed for the UK market, the algorithm required over 100,000 voice utterances. This eventually required utterances from 972 unique people who were sourced from almost every corner of the UK.

In another example, a business needed to train its AI algorithm to read handwritten documents. The brief was to deliver thousands of unique handwriting samples. The quantity of individuals was a critical factor, because the algorithm needed unique samples from a broad spectrum of people. More than 1,000 individuals had to be sourced to provide handwritten documents that met the requirements. The size of the crowd was critical to the success of both projects.

The majority of businesses don’t have access to the large number of participants needed to contribute data. A business could ask its employees to get samples from friends and family, but that would be ineffective and extremely difficult to project manage.

2. Quality of data

So, how do you produce quality training data? Let’s return to the handwriting samples. In that instance, the artifacts had to be legible, easily accessible and meet a host of other requirements based on the individual project goals. More specifically, there couldn’t be any defects on the page or even a single folded margin in the middle of the page. When users scanned the documents, they needed good light conditions or the ability to use flash in dark settings. During any project there are always specific requirements that need to be tracked and monitored very carefully.

Every individual artifact needs to be tested for quality to assure the algorithm will work as intended. Again, this process takes up a considerable amount of time and resources. While businesses could do this internally, it would prove to be costly and inefficient.

In the case of the handwriting samples example, if the business had taken responsibility for analysing every single document to confirm its quality, it would’ve taken them months and created a logistical nightmare. Instead, the process was completed in a matter of weeks.

3. Diversity of data

Besides producing reams of quality data, your team must also have a diverse range of artifacts to develop an accurate algorithm. Without diversity in the training data, the algorithm won’t be able to recognise a broad range of possibilities, which will make the algorithm ineffective.

When building an AI algorithm, you shouldn’t rely on a single person to provide the artifacts used to train the algorithm. To train an algorithm properly, you need different types of data and inputs, including geographical data, demographic information, types of documents and so on. Otherwise, the process will not lead to a strong output that will service the needs of a diverse customer base.

A crowdsourced community provides businesses with access to a global pool of participants. This model enables businesses to select hyper-specific demographics, including gender, race, native language, location, skill set, geography and many other filters.

Evolve with Your Project

Unfortunately, no project ends up exactly how it started. Needs shift over time and you have to change footing, get new data points, and source new testers or resources to input the information as the project evolves. When embarking on a project like this, always consider how you’re going to manage that data input and data quality process.

By continually ingesting new data, algorithms identify new trends and patterns and automatically adjust their predictions and outputs to better reflect the current landscape. However, AI algorithms are only as good as the quality of data they receive. Crowdtesting provides the diversity, connectivity and scale required to meet the demands of training AI algorithms and testing AI applications.

By John Kreyling, Managing Director, Centiel UK.
By David de Santiago, Group AI & Digital Services Director at OCS.
By Krishna Sai, Senior VP of Technology and Engineering.
By Danny Lopez, CEO of Glasswall.
By Oz Olivo, VP, Product Management at Inrupt.
By Jason Beckett, Head of Technical Sales, Hitachi Vantara.