Word Embedding and Data Splitting

This is the second article in an eight part series on a practical guide to using neural networks, applied to real world problems.

Specifically, a problem we faced at Metacortex. We needed our bots to understand when a question was being asked or a command given. The goal being to query the institutional knowledge base to provide answers. Today, we are utilizing a very similar model to the example(s) for sentence type classification, which we will work through here.

Note: the prior entry of this series Acquiring and Formatting Data for Deep Learning Applications covers formatting the data and we will take it right from there!

What this guide covers:

Acquiring & formatting data for deep learning applications
Word embedding and data splitting
Bag-of-words to classify sentence types (Dictionary)
Classify sentences via a multilayer perceptron (MLP)
Classify sentences via a recurrent neural network (LSTM)
Convolutional neural networks to classify sentences (CNN)
FastText for sentence classification (FastText)
Hyperparameter tuning for sentence classification

What is a Word Embedding?

To the point,

A word embedding is a numerical representation of a word, typically vectors.

In other words, a word embedding is a vector which represents the features of a word. An example of a feature could be length of word, another could be number of vowels. Often these can be “learned”, but for this guide we will keep it simple.

In our case, we will be using one-hot encodings, basically mapping every word to an index in the vocabulary. Arguably this is a 1D “embedding”, where the “feature” is it’s place in the mapping of word -> value:

Sentence(s): This is a sentence.

Embedding: [ 1, 2, 3, 4 ] -> [This, is, a, sentence]

Mapping: { “This”: 1, “is”: 2, “a”: 3, “sentence”: 4 }

Note, in the example above punctuation is removed. This is not necessarily recommended and we will leave punctuation for our case:

Sentence(s): This is a sentence.

Embedding: [ 1, 2, 3, 4, 5]

Mapping: { “This”: 1, “is”: 2, “a”: 3, “sentence”: 4, “.”: 5 }

With embeddings, if we then add a second sentence, the same values which represent words would be utilized:

Sentence(s): This is a sentence. This is a second sentence.

Embedding: [ 1, 2, 3, 4, 5, 1, 2, 3, 6, 4, 5 ]

Mapping: { “This”: 1, “is”: 2, “a”: 3, “sentence”: 4, “.”: 5, “second”: 6 }

Notice, the same mapping is used for each word, with the value of 6 being used for the word “second.”

What is a Sentence Embedding?

Sentence embeddings are a fair bit more complicated, but are the same principle:

A sentence embedding is a numerical representation of a sentence.

Typically, the requirement is that all sentences that relay the same information are represented as a single value.

However, this can be difficult to discern, even for humans — consider our previous example:

Sentence(s): This is a sentence. This is a second sentence.

In the case above, do they both relay the same information?

Personally, I would consider both the sentences to relay the same information, as the term “second” does change the meaning:

Sentence(s): This is a sentence. This is a second sentence.

Embedding: [ 1, 1 ]

Mapping: { “This is a ______ sentence.” }

Of course, when if we were to swap the word “second” for the word “not“, aka:

Sentence(s): This is a sentence. This is not a sentence.

This would change the meaning, at least in my view (and probably others). Which is why this is difficult and is still a very active area of research. Luckily, we don’t really need this to solve our current problem at hand! With that in mind, I will not even attempt to cover this further.

I’d recommend directly checking out the sentence embedding wikipedia entry for more information.

Word + Punctuation + POS Tags Embedding

For our case, we will use a combination of embedded words, punctuation, and the parts-of-speech (POS) tags, mentioned in the data augmentation step in the prior article.

This means, using the same examples as above, we’ll be getting something such as:

Sentence(s): This is a sentence. This is a second sentence.

Sentence(s) + POS Tags: This DT is VBZ a DT sentence NN . This DT is VBZ a DT second JJ sentence NN .

Embedding: [ 1, 2, 3, 4, 5, 2, 6, 7, 8, 1, 2, 3, 4, 5, 2, 9, 10, 6, 7, 8 ]

Mapping: { “This”: 1, “DT“: 2, “is”: 3, “VBZ“: 4, “a”: 5, “sentence”: 6, “NN“: 7, “.”: 8, “second”: 9, “JJ“: 10 }

We can accomplish this style of data embedding + augmentation on the whole dataset, with the following python (3.6+) code (assuming our data is in a list of lists):

That’s actually it, not super complicated. The code above will convert all of our sample comments to a vector of encoded words + punctuation + POS, as seen in the examples.

Word Embedding – Sentence Type Categories

For training purposes, we would have to do the same thing for our classification categories, so:

statement -> 0
commands -> 1
questions -> 2

We could use the same function above, the categories could also either be the same “embedding space” as the input or their own.

For instance, the mapping of statement would be the same value in the input and the output. However, it’s often easier [for organization purposes] to keep the word embedding values different for the input and then recreate them for the categories of classification. This is mostly personal preference.

Splitting the Data, Training vs Testing Sets

The next step after completing the word embedding(s), is splitting the dataset into a “training” and “testing” set. We do this to ensure some data is withheld to ensure the model (neural network) is not simply memorizing the dataset. The typical ratio is 80% of the data samples are for training and 20% of the data samples are withheld for testing. The ratio of data withheld can be adjusted, but 20% is recommended.

With that in mind, we can finally start putting this all together!

Emphasizing the point made in the prior article — the data first needs to be shuffled (very important). Once shuffled, we split into a correlated lists, such that the index of a comment in the comments lists matches index of the associated label in the categories list:

Once the the data shuffled and split into correlated lists, we can put the pieces together and create a training and testing set of encoded comments & labels (i.e. word embedded vectors):

That’s it! After we call the above function we should have the training and testing data samples & associated labels, such that we can start training neural networks.

Up Next…

The next article will cover building a very basic model which is just a key words search to classify the sentences, hopefully our neural networks can handily do better, but we need a baseline.

Full Guide:

Acquiring and Formatting for Deep Learning Applications
Word embedding and data splitting
Bag-of-words to classify sentence types (Dictionary) (Next Article)
Classify sentences via a multilayer perceptron (MLP)
Classify sentences via a recurrent neural network (LSTM)
Convolutional neural networks to classify sentences (CNN)
FastText for sentence classification (FastText)
Hyperparameter tuning for sentence classification