Acquiring and Formatting Data for Deep Learning Applications

This is the first article in an eight part series on a practical guide to using neural networks to solve real world problems. In this guide, we’ll be covering several neural network architectures designed for sentence classification.

Specifically, we’ll be using neural networks to solve a problem we faced at Metacortex. We needed our bots to understand when a question was being asked, the goal being to create an intuitive query interface for an organizations institutional knowledge.  Today, we are utilizing a very similar model to the example(s) for sentence type classification, which we will work through in this series.

What are sentence types, you ask?

In English, we have four main sentences types:

  • Declarative Sentences (Statements)
  • Imperative Sentences (Commands)
  • Exclamatory Sentences (Exclamations)
  • Interrogative Sentences (Questions)

Every sentence we speak or write, falls into one of the categories above. Identifying a sentence type can be very usual for bots to determine. For instance, this can inform them whether they are being “commanded”, “asked a question” or “receiving positive / negative feedback”. Thus, our goal is to classify written sentences, phrases, or comments (multiple-sentence, single purpose comment) into one of the categories above.

What this guide covers:

  1. Acquiring & formatting for deep learning applications
  2. Word embedding and data splitting
  3. Bag-of-words to classify sentence types (Dictionary)
  4. Classify sentences via a multilayer perceptron (MLP)
  5. Classify sentences via a recurrent neural network (LSTM)
  6. Convolutional neural networks to classify sentences (CNN)
  7. FastText for sentence classification (FastText)
  8. Hyperparameter tuning for sentence classification

What we will use:

  1. Github (repo)
  2. Python (3.6+)
  3. TensorFlow (1.12) + Keras (2.2.4) + NLTK (3.4)
  4. (Optional) CUDA + CuDNN

Obtain Data, Step One

The first step to for all machine learning related work is obtaining data, lots of it. The more data you can get the better. You may even go so far as generate data via simulations or bootstrap your real data with synthetic data. The way I think of machine learning, is that we create a model to learn the contours of our data – the more data, the more robust the model can be (not saying it will be), because the data has a higher resolution and variations.

Where I tend to start, is researching the literature – AKA searching for “sentence type datasets”, “sentence type examples”, and so on. Usually, I research for a few hours; during which time, I’ll also see if I can find any examples I can build off of. I recommend trying this yourself; try to search for a labeled dataset of sentences, where each sentence is labeled as a question, statement, exclamation or question.

….

I find the best way to learn is to fumble through, so I really do recommend searching for some labeled datasets.

Identified Datasets

Obviously, this can change, but below are some of the datasets / papers I found that may provide insights into our problem:

Unfortunately, that’s pretty much all I could find after a couple hours of searching and reading — which surprised me. I expected more out there, as this is a fairly standard grade school English problem / lesson. If anyone reading this finds more, please submit a PR to the Github repo (or provide a link in the comments).

With the above two datasets (SQuAD 2.0 and SPAADIA) we get a fairly large number of statements (declarative sentences) and questions (interrogative sentences), however we have very few commands (imperative sentences) and no labeled exclamations.

Identify Data Limitations, Step Two

The next step is asking:

Is there enough data to train a neural network for this problem?

This may sound like a simple question, but the answer is complicated. Today, we don’t have a good way of defining how much data we need to solve a problem.

Although, this may have changed by the time you are reading this — generally, you consider one main factor:

Variance in data, the more variance a problem set has, the more data you need.

In our case, our problem set is sentence type classification.

Sentences such as questions will likely always contain the words: “who, what, where, when, why, how, or which“. Similarly, there are words that almost always apply to the other sentence types, so our variance is relatively low, thus a smaller dataset (~100,000 examples) is likely acceptable. That means the SQuAD 2.0 and SPAADIA datasets we have something around: ~100k questions, plus ~80k statements, which is probably more than enough to get started. However, for commands we only have 264 total samples and no labeled exclamations.

This is where the interesting part begins:

Do we make our own dataset?

This can be very tricky, very time consuming, and potentially cost thousands of dollars.

Luckily, exclamations are not the most important to a bot, so let’s can skip that now. However, classifying commands is still going to be important, so we need to create / find a dataset. To that end, I added around 1,000 examples of commands to the github repo. Is that enough? Probably not, but it’s a start and feel free to contribute examples to the dataset. In addition, in a later section we’ll be covering creating a more robust dataset, which will expand those examples.

In summary, this leaves us with the following:

  • >80k samples – Declarative Sentences (Statements)
  • 1264 samples – Imperative Sentences (Commands)
  • 0 samples – Exclamatory Sentences (Exclamations)
  • 100k samples – Interrogative Sentences (Questions)

Merging Datasets, Step Three

The next step is fairly self explanatory — load several datasets, format, output as a single unified dataset.

For our example, we load all the samples along with a label as a dict in python (SPAADiA and command example):

Simple enough, the data structure looks like: { “sentence to be classified”: “label” }.

Generate Variations in Sample Data, Step Four

One of the key aspects to developing a dataset for neural networks is making it as robust as possible. This can mean several things, but the general goal is to either add noise or alterations to the data to ensure your neural network can still classify the sentence appropriately.

In our case, that means doing alterations such as the following…

Take for instance, the following example:

Statement: This is a sentence about France. France is located in Europe.

Question: Where is France located?

The first thing we can do, is remove punctuation:

Statement: This is a sentence about France France is located in Europe

Question: Where is France located

Another thing we can do, is moving a single statement before the question:

Statement: This is a sentence about France. France is located in Europe.

Question: France is located in Europe Where is France located?

We can also likely remove words at random, (this is a method BERT employs):

Statement: This is a sentence ___ France. France is located in _____.

Question: Where is _____ located?

This seems to massively increase our dataset, but each one should still be considered “one sample”, just different variations on each sample. This is because the general content is still the same, and as such does not teach the neural network content changes, just syntactic changes. Our neural network will still need tens of thousands of unique samples, but this ensures that different formats will have a minimal impact on performance.

Data Augmentation, Step Five

The final (and optional) step in the process of creating a dataset can be data augmentation. This typically means doing something such as adding parts-of-speech (POS) components, such as “noun, verb, adjective, etc.” (we accomplish this via the python package NLTK).

If we wished to apply that we could do something such as the following (using the same example as the prior section):

Statement: France Proper Noun is Verb located Verb in Preposition Europe Proper Noun.

Question: Where Adverb is Verb France Proper Noun located Verb?

The additional words may make the sentence seem funky… However, the neural network can learn from the structure, for example – commands typically lead with a verb. Neural networks learn to weight each of the words accordingly, so if the second word being “verb” relates to the sentence being classified as a “command”, it’ll help improve the networks stability and accuracy.

Parts-of-speech tags will assist the neural networks understanding for new words the network hasn’t seen before in testing.

How much this particular data augmentation will help a network is something that would have to be determined experimentally (which we will test later in this series of articles).

Note: There are other methods of data augmentation, however it is common to use parts-of-speech and generally improves robustness and accuracy.

Data Shuffling, Step Six

The final step, is to take the sentences in the form: { “sentence(s)”: “label” } and shuffle them such that they are in a random order.

The following code will split the data from a python dict to two correlated lists, which are randomly sorted:

This ensures that any ordering to the data samples are mixed up, which will help to ensure testing is valid. It will also will improve the robustness of neural networks such as LSTM (or other models) which have “memory” associated with them (more on that later).

Up Next…

The next article will cover taking the output of the sentences and labels and converting them into what our neural networks will ingest, train, and test on.

Full Guide:

  1. Acquiring & formatting data for deep learning applications
  2. Word embedding and data splitting (Next Article)
  3. Bag-of-words to classify sentence types (Dictionary)
  4. Classify sentences via a multilayer perceptron (MLP)
  5. Classify sentences via a recurrent neural network (LSTM)
  6. Convolutional neural networks to classify sentences (CNN)
  7. FastText for sentence classification (FastText)
  8. Hyperparameter tuning for sentence classification

Leave a Reply

Your email address will not be published. Required fields are marked *

 characters available

Time limit is exhausted. Please reload the CAPTCHA.