FastText for Sentence Classification

This is the seventh article in an eight part series on a practical guide to using neural networks, applied to real world problems.

Specifically, this is a problem we faced at Metacortex — where we needed our bots to understand when a question, statement, or command sent to our bot(s). The goal being to query the institutional knowledge base to provide answers.

This article covers applying the FastText algorithm sentence type classification (statement, question, command).

Full Guide:

  1. Acquiring & formatting data for deep learning applications
  2. Word embedding and data splitting
  3. Bag-of-words to classify sentence types (Dictionary)
  4. Classify sentences via a multilayer perceptron (MLP)
  5. Classify sentences via a recurrent neural network (LSTM)
  6. Convolutional neural networks to classify sentences (CNN)
  7. FastText for Sentence Classification (FastText)
  8. Hyperparameter tuning for sentence classification

Introduction to FastText

FastText is an algorithm developed by Facebook Research, designed to extend word2vec (word embedding) to use n-grams. This improves accuracy of NLP related tasks, while maintaining speed. An n-gram represents N words prior to the current word to create a single phrase. This provides context to the input similar to the way the RNN interprets the time series aspect and the CNN encodes the spatial aspect of the data.

For example:

Sentence: This is a test phrase.

1-Gram (Unigram):  [ This, is, a, test, phrase, . ]

2-Gram (Bigram): [ This is, is a, a test, test phrase, phrase. ]

3-Gram (Trigram): [This is a, is a test, a test phrase, test phrase. ]

To clarify, we are using word embeddings as opposed to character embeddings.

Character embeddings for the above would look something like this:

Sentence: This is a test phrase.

3-Gram (Trigram): [ This, his, is i, s is, s a, a t, a te, tes, test, est, st p, t ph, phr, phra, hras, rase, ase, se ., e . , . ]

Using character n-grams can create a more robust network as partial components of words are often shared. However, for our case, we use a combination of words and punctuation, leaving out the parts-of-speech and not using characters.

After, the n-grams have been created, the features are then averaged (pooling) and send to the hidden variables. Then we apply a softmax activation function to the output, that’s it!

In terms of an architecture, it’s relatively simple:

The only change to a standard Multilayer Perceptron (MLP) is using the n-grams as the input and the average pooling for the next layer.

For further reading, I recommend the original publication: Bag of Tricks for Efficient Text Classification

Classify Sentence Types with a FastText

With the understanding of what FastText is, our configuration:

  • Input Layer (I)
    • Size: Max words (set to 10000 unique words + punctuation)
    • n-grams (bigrams)
  • Hidden Layer(s) (H)
    • Average pooling
  • Output Layer (O)
    • Size: 3 (Sentence Types, three categories: statement, question, command)
    • Activation Function Softmax
  • Optimizer: Adaptive Moment Estimation (Adam)
    • Loss Function: Categorical Cross Entropy for the loss function (used for the optimizer)
    • Optimizing on: Accuracy

In terms of Python (3.6+) & Keras (2.2.4) code:

import keras
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding,GlobalAveragePooling1D
model = Sequential()
# Created Embedding (Input) Layer (max_words) --> Pooling Layer
model.add(Embedding(max_words, embedding_dims, input_length=maxlen))
# Create the average Pooling Layer
model.add(GlobalAveragePooling1D())
# Create the output layer (num_classes)
model.add(Dense(num_classes, activation='softmax'))
# Add optimization method, loss function and optimization value
model.compile(loss='categorical_crossentropy',
optimizer='adam', metrics=['accuracy'])
# "Fit the model" (train model), using training data (80% of datset)
model.fit(x_train, y_train, batch_size=batch_size,
epochs=epochs, validation_data=(x_test, y_test))
# Evaluate the trained model, using the test data (20% of the dataset)
score = model.evaluate(x_test, y_test, batch_size=batch_size)

Data Input Formatting

The key to FastText is the n-gram creation, so as you may have guessed quite a bit of data formatting is required. Luckily, the idea behind n-grams are fairly well known and even used in common databases such as PostgreSQL (which has built-in trigram searching).

In terms of the n-gram creation, I ended up using the examples from the Keras repository:

"""
Taken from: https://github.com/keras-team/keras
Based on Joulin et al's paper:
Bags of Tricks for Efficient Text Classification
https://arxiv.org/abs/1607.01759
"""
def create_ngram_set(input_list, ngram_value=2):
"""
Extract a set of n-grams from a list of integers.
>>> create_ngram_set([1, 4, 9, 4, 1, 4], ngram_value=2)
{(4, 9), (4, 1), (1, 4), (9, 4)}
"""
return set(zip(*[input_list[i:] for i in range(ngram_value)]))
def add_ngram(sequences, token_indice, ngram_range=2):
"""
Augment the input list of list (sequences) by appending n-grams values (example: bi-gram)
>>> sequences = [[1, 3, 4, 5], [1, 3, 7, 9, 2]]
>>> token_indice = {(1, 3): 1337, (9, 2): 42, (4, 5): 2017}
>>> add_ngram(sequences, token_indice, ngram_range=2)
[[1, 3, 4, 5, 1337, 2017], [1, 3, 7, 9, 2, 1337, 42]]
"""
new_sequences = []
for input_list in sequences:
new_list = input_list[:]
for ngram_value in range(2, ngram_range + 1):
for i in range(len(new_list) - ngram_value + 1):
ngram = tuple(new_list[i:i + ngram_value])
if ngram in token_indice:
new_list.append(token_indice[ngram])
new_sequences.append(new_list)
return new_sequences

With the n-gram creation functions, creating the n-grams for the training / testing data is possible. We first convert the output values to categories, then we convert the embedded input words + punctuation into n-grams (bi-grams in our case). Once we have bi-grams we pad the rest of the values and we are good to go:

max_words, batch_size, maxlen, epochs, ngram_range = 10000, 32, 500, 5, 2
# Determine the number of categories + default(i.e. sentence types)
num_classes = np.max(y_train) + 1
# Vectorize the output sentence type classifcations to Keras readable format
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
if ngram_range > 1:
# Create set of unique n-gram from the training set
ngram_set = set()
for input_list in x_train:
for i in range(2, ngram_range + 1):
set_of_ngram = create_ngram_set(input_list, ngram_value=i)
ngram_set.update(set_of_ngram)
# Dictionary mapping n-gram token to a unique integer
# Integer values are greater than max_words in order
# to avoid collision with existing features
start_index = max_words + 1
token_indice = {v: k + start_index for k, v in enumerate(ngram_set)}
indice_token = {token_indice[k]: k for k in token_indice}
# max_words is the highest integer that could be found in the dataset
max_words = np.max(list(indice_token.keys())) + 1
# Augmenting x_train and x_test with n-grams features
x_train = add_ngram(x_train, token_indice, ngram_range)
x_test = add_ngram(x_test, token_indice, ngram_range)
# Pad the input vectors to ensure a consistent length
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)

There’s a couple of caveats with FastText at this point — compared to the other models, its relatively memory intensive. This is especially true, as the n-grams size increases it will increase the memory usage significantly (uni-gram to bi-gram is 2x the memory). In addition, padding is dictected by maxlen and should be as short as practical. A high amount of padding will dramatically slow down training and classification and increase the memory requirements.

FastText to Classify Sentence Types

With data formatting complete, we can assemble the final components.

We will be using the following hyperparameters:

  • ngram_range = 2
  • max_words = 10000
  • maxlen = 500
  • batch_size = 32
  • embedding_dims = 50
  • epochs = 5
max_words, batch_size, maxlen, epochs, ngram_range = 10000, 32, 500, 5, 2
# Determine the number of categories + default(i.e. sentence types)
num_classes = np.max(y_train) + 1
# Vectorize the output sentence type classifcations to Keras readable format
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
if ngram_range > 1:
# Create set of unique n-gram from the training set
ngram_set = set()
for input_list in x_train:
for i in range(2, ngram_range + 1):
set_of_ngram = create_ngram_set(input_list, ngram_value=i)
ngram_set.update(set_of_ngram)
# Dictionary mapping n-gram token to a unique integer
# Integer values are greater than max_words in order
# to avoid collision with existing features
start_index = max_words + 1
token_indice = {v: k + start_index for k, v in enumerate(ngram_set)}
indice_token = {token_indice[k]: k for k in token_indice}
# max_words is the highest integer that could be found in the dataset
max_words = np.max(list(indice_token.keys())) + 1
# Augmenting x_train and x_test with n-grams features
x_train = add_ngram(x_train, token_indice, ngram_range)
x_test = add_ngram(x_test, token_indice, ngram_range)
# Pad the input vectors to ensure a consistent length
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
model = Sequential()
# Created Embedding (Input) Layer (max_words) --> Pooling Layer
model.add(Embedding(max_words, embedding_dims, input_length=maxlen))
# Create the average Pooling Layer
model.add(GlobalAveragePooling1D())
# Create the output layer (num_classes)
model.add(Dense(num_classes, activation='softmax'))
# Add optimization method, loss function and optimization value
model.compile(loss='categorical_crossentropy',
optimizer='adam', metrics=['accuracy'])
# "Fit the model" (train model), using training data (80% of datset)
model.fit(x_train, y_train, batch_size=batch_size,
epochs=epochs, validation_data=(x_test, y_test))
# Evaluate the trained model, using the test data (20% of the dataset)
score = model.evaluate(x_test, y_test, batch_size=batch_size)

If we run the code, along with our testing data (which you can do from the github repo):

The FastText accurately classifies ~95.59% of sentence types, on the withheld test dataset.

Overall, that’s:

  • A 3% reduction in accuracy of classification compared with the RNN
  • A 2% reduction in accuracy of classification compared with CNN
  • A 1% reduction in accuracy of classification compared with MLP
  • A 10% improvement in accuracy, over our baseline keyword search solution

What’s also important is speed, mostly of classification, but also of training. I have the following computer configuration:

  • RAM – 32 Gb DDR4 (2400)
  • CPU – AMD Ryzen 7 1800x eight-core processor × 16
  • GPU – GTX 1080
  • OS – Arch Linux

Speed (per step):

The model can train at  196 μs/step and classify at 26 μs/step (μs – microsecond, ms – millisecond)

In comparison, FastText is:

  • 38x faster than the RNN
  • 2x faster than MLP
  • 1.3x faster than CNN

That makes FastText the fastest model; the dictionary check is still the fastest, but horrible accuracy.

Up Next…

The next article in the series is a comparison of the models and a brief introduction to hyperparameter tuning.

Full Guide:

  1. Acquiring & formatting data for deep learning applications
  2. Word embedding and data splitting
  3. Bag-of-words to classify sentence types (Dictionary)
  4. Classify sentences via a multilayer perceptron (MLP)
  5. Classify sentences via a recurrent neural network (LSTM)
  6. Convolutional neural networks to classify sentences (CNN)
  7. FastText for Sentence Classification (FastText)
  8. Hyperparameter tuning for sentence classification (Next Article)

3 thoughts on “FastText for Sentence Classification

  1. Hi Austin,
    Nice post! I have a question. Usually, when I have implemented a text classifier before, as a first step I’ve trained an embedding algorithm with a corpus (w2v, ft, glove… whatever), then insert it to the keras embedding layer passing the weights matrix as an input argument, and turning the trainable parameter to False, in order to leave embedding weights as they were learned by the native algorithm.
    In this case no weights are passed as argument, so the layer takes a random initialization. This is also the way that FacebookResearch use in its own FastText Classifier?
    I hesitated about the n-gram meaning because in unsupervised fasttext denotes subword chunks, but as I can see either in your post or in the official site, when talking about the classifier n-gram, it denotes groups of words.
    It means that the fasttext classifier does not take into account the subword level?

    Hope you can clarify me these doubts.

    Thanks in advance!

    1. This fasttext model does not take into account subwords as it uses word level embeddings.

      In this case, we are creating our own encoding, such that each word is represented by a single value. This would make it impossible to use n-grams on subparts of words[1]. However, n-grams can refer to n-words, n-characters, or even n-sentences. For the guide the “gram” represents “words and/or punctuation”. You could attempt to convert the code to one-hot encode at the character level as well, and then the “gram” in “n-grams” would be representing n-characters.

      [1] https://austingwalters.com/word-embedding-and-data-splitting/

Leave a Reply

Your email address will not be published. Required fields are marked *

 characters available

Time limit is exhausted. Please reload the CAPTCHA.