Snips Data Generation as a service: fixing the cold-start problem for natural language interfaces


Knipt Data Generation als een service: het vaststellen van het cold-start-probleem voor natuurlijke taalinterfaces

Door David Leroy , Alice Coucke , Thibault Gisselbrecht en Joseph Dureau , illustraties door Quentin Le Bras

"Sorry dat ik niet begreep wat je zei"

In een vrij kort tijdsbestek is natuurlijke taal de volgende grens voor mens-machine-interacties geworden - het stelt klanten in staat om apparaten op een intuïtieve manier te gebruiken, zonder dat ze opnieuw een nieuwe onvriendelijke interface moeten leren. Dit nieuwe paradigma heeft een geheel nieuwe wereld van ervaringen voor merken geopend en de hype is reëel - de chatbot-wereld is de laatste jaren gek geworden en hetzelfde gebeurt momenteel met spraakinterfaces.

Opmerking voor de lezer: natuurlijke taal is goed voor zowel tekst als spraak - de voorbeelden die hier worden gekozen zijn gericht op stemgeluid, want dit is ons huidige productaanbod, maar het meeste van wat hieronder wordt gezegd, geldt ook voor chatbots

Het draait allemaal om kwaliteit en eigenaarschap van ervaringen

As a brand, you’re probably considering adding natural language support to the devices you’re selling. Let’s assume for instance that your company is building a connected speaker that supports voice-enabled song navigation and volume management. As a brand, willing to deliver a great experience to your customers, you’ll have two priorities:

  • You’ll want to deliver an experience that feels intuitive, where your customer can talk in any natural way and be understood, without having to guess the boundaries of your product
  • You’ll also want to provide a fully unique experience, that perfectly exposes your product’s features, and matches your brand identity

You might have considered embedding a generic GAFA assistant on your device, favouring quality over ownership and customer privacy. You may also have considered Alexa’s custom skills, or Google’s Dialog Flow custom agents. These solutions will allow you to fully customize the experience you want to provide. Unfortunately, for a custom assistant to reach good performances, you will need data.

Voice assistant solutions wrt quality and ownership

At Snips, we don’t believe brands should have to sacrifice quality for ownership, or vice versa — that’s why we’ve created a service that enables any company to build their own voice assistants catching up with the big players’ built-in standards without having to make any compromise about their product experience.

How do we do that ? We’re addressing the quality issue by providing state-of-the-art algorithms for speech recognition and natural language understanding. But algorithms aren’t enough. Unsurprisingly, to build a solid voice experience, you need training examples.

NB: We’ve built our service in such a way that its output ( e.g training examples ) can be downloaded and used on any on another voice or chatbot platform such as Alexa, Dialogflow, Wit.ai, etc.

Before we dive into how we generate those training examples, let’s define how you can build your own voice assistant, and consider why data is so critical.

Building a new voice experience

1. Define your objective: understand the intention of a query, and its parameters

The first thing you will have to do when it comes to creating your own custom assistant, is to define what we call an ontology: in the smart speaker example, this is basically defining the scope of intentions it should be supporting.

An ontology is a set of intents (user intentions) with their slots. A slot is a key information to be extracted from the query, such as an artist name, for instance. Each intent is likely to be matched with an action — an instruction to be executed directly on the device, an external API call, …

Ontology for the smart speaker example

2. Check what data you have in store

You have defined the scope of what your assistant should cover. Now you require training examples that will be used to train the NLU (natural language understanding) component. This component will be in charge of detecting an intent and its slots in a written utterances.

Since adding a conversational interface to your speaker is a new feature, it’s very likely that you do not have any existing data matching the intents in your ontology.

If you do, it probably comes unlabelled: the intention behind each query isn’t explicit in your dataset, and the slots of each query aren’t tagged as such. Making sense of this raw data will take massive, error-prone efforts.

In addition, you should think of whether this data is representative of the way you want your users to talk to your assistant. Historical logs of a search bar, for example, won’t do a very good job because users generally don’t form sentences the same way they would do by voice. All in all, you probably don’t have the data you need to train a voice assistant that matches your expectations.

3. Feed your assistant with training data

You’re left with no data to train your first assistant - this is referred as the cold start problem.

Until today, you could tackle this problem in two different ways — either narrow the coverage of your assistant and focus on simple keyword spotting, or you could start with a dozen training examples and count on your first users to pick up the pieces. Let’s review these two approaches.

The Keyword spotting approach

Keyword spotting is the easy approach and a short-term win: matched keywords will trigger actions, and you’ll get a very predictable behaviour on a limited scope.

It ties your assistant to a “keyword budget" — the size of the vocabulary you can handle. And as you’ll want to enrich your natural language interface experience, your available keyboard budget will shrink in no time. For instance, supporting conjugation, gender and plural increase vocabulary size by an order of magnitude. If you want your assistant to be robust to wording diversity you’ll rapidly meet the boundaries of such an approach.

To illustrate this, let’s just consider that your intent has a slot that is related to a date or a time. You would be able to detect a date such as “tomorrow" without any trouble but catching and interpreting expressions like “two hours from now" brings you to a whole new level of complexity.

Back to the connected speaker, this is the kind of commands you could expect to handle with such an approach:

Ex of voice commands that can be handled by keyword spotting

On the other hand, you would have trouble catching more advanced formulations:

Ex of voice commands that need a more advanced speech-to-meaning engine

The “dozen training examples" data regime

There are of course alternative approaches to keyword spotting. Some conversational assistants platforms out there (including Snips) will provide you with the right tools to train a more advanced, machine learning based NLU model from a small set of training examples.

Providing sentences for the playArtist intent — artistName highlighted in blue

Writing down these training examples by yourself is a cumbersome work — your creativity will collapse after a dozen different formulations. Unfortunately, in most cases, this won’t be enough to reach good performances, as we are suggesting in this benchmark. In some cases, you would have to gather up to 2000 training examples per intent to reach satisfactory performances.

One solution to bypass collecting this substantial amount of sentences is to hinge on user data to improve quality over time. This is the common approach to building conversational UIs, deploy a fairly simple first version, and leverage the infamous feedback loop. Unfortunately, this approach has some major drawbacks:

  • First impressions matter: Relying on early customers to pick up the pieces is a risky move. People are expecting experiences that are at least matching the Google/Amazon built-in standard, and poor first impressions can rapidly backfire. Best case scenario, people will learn which kind of sentence is properly parsed by trial and error and then keep using it — and that’s not what voice should be about.
  • Measuring performances: Starting with (very) little data also means no test set to evaluate your assistant upon. This will make progress very hard to track. You need to define a valid quality assessment strategy from the beginning, and this strategy involves collecting a test set that is representative of your product usage in real conditions.
  • Building the feedback pipeline is heavy work: Collecting user data means collecting an audio or a text stream straight from your consumer’s devices — which involves dealing with some very sensitive data. Collecting is one thing, you then need this data to be supervised with the right intent and slots — this involves serious crowdsourcing work, and most platforms out there will not support it out of the box.

Towards a richer experience

Unsurprisingly, for a greater experience, you need to gather more data. This data has to be representative of the way your end users will be interacting with your device — most natural formulations should be oversampled, while least natural ones should also be covered, but in smaller proportions. Gathering this balanced and comprehensive dataset is critical for the success of your product — and most voice assistant solutions do not address this issue.

Wouldn’t it be easier to build a first version that just works, that you’ve been able to evaluate before you shipped it, providing your users with an amazing first experience, handling fully natural interactions, without the need for more than classic analytics to make sure everything’s on track ?

This is all about collecting usage data upfront, and ensure quality before the launch.

Introducing Snips Data Generation service

At Snips, we believe that we can build great AIs without compromising the user privacy — without collecting any data from our users. When we started working on voice assistants, we’ve had to figure out a way to generate training data for our NLU (Natural Language Understanding) and ASR (Automatic Speech Recognition) algorithms. That’s why we have created a data generation service that we are now opening to the public. This service allows you to generate any volume of training examples, that come pre-tagged for a minimal labelling effort.

You’re now able to build a robust assistant on any use-case, and in no time — saving you the cost of a bad first impression. And once again, the generated training examples can be downloaded and used on any on another voice or chatbot platform.

This service has been used internally at Snips for over a year— this is the process that is behind our benchmark data. We’ve shown that on the set of intents considered for the benchmark, you could expect up to a 50% gain in performances up to 93% f1-score when training our intent parser on 2000 sentences instead of 10, as shown below.

Performances of Snips’ NLU wrt to the number of training examples provided

This of course dose not mean that every intent needs 2000 training examples. What you need to figure out is the right data regime, where your assistant performances reach a plateau. It means that it has learnt to generalize from the data it has seen up to a point where it makes very rare mistakes on unseen data.

Our data generation engine involves a mix of machine learning algorithms and human-in-the-loop. We’ve put an emphasis on ensuring wording diversity to maximize coverage. Quality is guaranteed by human validation paired with a semi-supervised learning approach that gets rid of the most of ambiguous and ill-formatted training examples.

Ex of training examples from data generation

We’re just releasing this feature, so your feedback is more than welcome. You can reach out to us by email — support-data-generation@snips.ai

If you want to start using it, head to our console and start building awesome assistants ! We’re pretty excited to see what you’ll be building with it !

If you liked this article and want to support Snips, please share it :)

Follow us on Twitter

If you want to work on AI + Privacy, check our jobs page!