Sentence Generator: An example of dataset design
The Problem:
We needed to produce a machine learning dataset consisting of thousands of unique utterances each with the same intent. Some examples of the types of intents: asking a voice assistant for the weather forecast, requesting news updates, playing music, making a phone call, setting a timer.
Challenges:
We needed to produce high-quality data, meaning it sounded natural and contained a lot of variety
The dataset needed to be delivered with an extremely fast turnaround.
Hiring and onboarding freelancers is an operational cost, and a time-consuming endeavor.
Quality assurance on top of the data produced by freelancers is a challenge on top of data production
The Solution:
I built sample sets of data so that I could analyze and understand the problem. In doing so, I recognized that breaking sentences down into syntactic and semantic components would yield several templates that sentences of the same intent could follow. I would find words or short phrases that could be used interchangeably within the components of these templates, whereby each template could yield potentially hundreds of unique sentences.
Ensuring the solution produced high quality data would thereafter become a matter of introducing enough variety, which would require linguistic analysis and data management.
To illustrate:
Intent: Tell a voice assistant to play a song or record.
Sample Templates for this intent, composed of Components (in brackets):
Template 1: [play] [name of record]
Template 2: [play] [name of record] [on] [my] [speaker]
Template 3: [play] [name of record] [on] [my] [living room] [speaker]
Wordlists: for each Component of a template, a list of sample words that can be used interchangeably:
[play] = [play, stream, put on, air, run, start, fire up, turn on, start playing]
[on] = [on, over, using]
[my] = [my, our, the, this]
[speaker] = [speaker, sound system, stereo, soundbar, audio system]
[living room] = [living room, kitchen, dining room, bookshelf, bedroom]
Below I demonstrate what it looks like to go from templates and wordlists to a dataset of unique utterances:
[play]
play
put on
air
run
start
fire up
turn on
start playing
[play]
play
put on
air
run
start
fire up
turn on
start playing
[name of record]
[play]
play
put on
air
run
start
fire up
turn on
start playing
[name of record]
[on]
on
over
using
[my]
my
our
the
this
that
[speaker]
speaker
sound system
stereo
soundbar
audio system
Template 3 yields 3,000 unique sentences:
We can build many more templates using variations on the one sampled above, ensuring greater variety and ultimately higher quality data:
[could you]
[play]
[record]
[my]
[on]
[living room]
[speaker]
[play]
[record]
[my]
[on]
[living room]
[speaker]
[play]
[record]
[my]
[on]
[living room]
[speaker]
[for me]
[please]
[name of record]
[play]
[record]
[my]
[on]
[living room]
[speaker]
[play]
[record]
[my]
[on]
[living room]
[speaker]
[for me]
[play]
[play]
[record]
[record]
[my]
[my]
[on]
[on]
[living room]
[living room]
[speaker]
[for me]
[speaker]
[could you]
[play]
[record]
[my]
[on]
[living room]
[speaker]
[could you]
[play]
[record]
[my]
[on]
[living room]
[speaker]
[please]
[play]
[record]
[my]
[on]
[living room]
[speaker]
[for me]
[for me]
[could you]
[please]
[please]
Now with this fundamental system in place, the focus became ensuring there would be enough syntactic and semantic variety through unique sentence structures; because while it is often the case that people will use this type of sentence structure while telling a voice assistant to play a song, there are other natural ways to utter this type of command as well. And we needed to produce natural, varied data.
Modeling this was done through deep linguistic analysis to ensure that we had enough unique templates that could encapsulate the breadth and variety of unique ways people might express this intent.
Hereafter, the task was an effort in data management, quality assurance and timely delivery.
[play]
[on]
on
over
using
[record]
[my]
my
our
the
this
that
[my]
[on]
Template 1 yields 8 unique sentences:
[speaker]
speaker
sound system
stereo
soundbar
audio system
Template 2 yields 600 unique sentences:
[living room]
living room
dining room
kitchen
bathroom
bedroom
[living room]
[speaker]
[for me]
[could you]
[play]
[record]
[my]
[on]
[living room]
[speaker]
[please]
[please]
[please]