Hyperparameter Tuning for Sentence Classification

This is the eighth and final article in an eight part series on a practical guide to using neural networks applied to real world problems.

Specifically, this is a problem we faced at Metacortex. We needed our bots to understand when a question, statement, or command sent to our bot(s). The goal being to query the institutional knowledge base to provide answers.

This article is specifically an introduction to hyperparameter tuning, utilizing the most performant model for sentence classification as an example.

Full Guide:

  1. Acquiring & formatting data for deep learning applications
  2. Word embedding and data splitting
  3. Bag-of-words to classify sentence types (Dictionary)
  4. Classify sentences via a multilayer perceptron (MLP)
  5. Classify sentences via a recurrent neural network (LSTM)
  6. Convolutional neural networks to classify sentences (CNN)
  7. FastText for sentence classification (FastText)
  8. Hyperparameter Tuning for Sentence Classification

What are Hyperparameters?

Before we get started, it’s important to define hyperparameters. In short:

Hyperparameters are the parameters fixed before the model starts training

In the case of basic statistical models, perhaps all of the parameters are all hyperparameters. However, for neural networks there are often hundreds, thousands or even millions of variables constantly changing (the weights). The hyperparameters are the nobs we as engineers / data scientists control to influence the output of our model(s).

A good summary of hyperparameters can be found on this answer on Quora:

Hyperparameters:

  • Define higher level concepts about the model such as complexity, or capacity to learn.
  • Cannot be learned directly from the data in the standard model training process and need to be predefined.
  • Can be decided by setting different values, training different models, and choosing the values that test better

In our case, some example of hyperparameters include:

  • Epochs
  • Batch Size
  • Max Length of Input

Why are hyperparameters important?

This model had too many epochs,

This model had too few epochs,

But this model has just the right number

First, Hyperparameters can have a dramatic impact on the accuracy. As an example, if you took any of our prior examples in this series and set epoch count to one, the accuracy of the models would be dramatically reduced. Set the epochs parameter too high and the model will over train and on the test data the models accuracy would be dramatically reduced.

Second, hyperparameters can impact model stability. In almost all of our deep learning models, there is a significant amount of random noise added. Everything from dropout to the data selected for training / testing. This can cause models to collapse / not converge (i.e. fail to produce accurate results), even when seemingly nothing has changed. This is why I often run a model with a given configuration five to ten times to see the variance in the results.

Having an accurate model is always the goal, but when attempting to form a general solution, low variance between trainings is also desired.

Best Performing Model for Sentence Classification

Without hyperparameter tuning (i.e. attempting to find the best model parameters), the current performance of our models are as follows:

Model Accuracy Train Speed Classification Speed
Dict 91% Fastest Fastest
CNN 97.80% Fast (200 us/step) Very Fast (35 us/step)
MLP 95.5% Very Fast (60 us/step) Very Fast (42 us/step)
FastText (1-gram) 94.44% Fast (83 us/step) Very Fast (26 us/step)
FastText (2-gram) 95.59% Fast (196 us/step) Very Fast (26 us/step)
RNN (LSTM) 98.49% Very Slow (7000 us/step) Very Slow (1000 us/step)

Overall, the LSTM  is slightly ahead in accuracy, but dramatically slower than the other methods. The CNN has the second highest accuracy and is the second fastest model.

In other words, the Convolutional Neural Network (CNN) is overall the most performant model. In terms of accuracy, it’ll likely be possible with hyperparameter tuning to improve the accuracy and beat out the LSTM.

Hyperparameter Tuning the CNN

Certainty, Convolutional Neural Network (CNN) are already providing the best overall performance (from our prior articles). Thus, it makes sense to focus our efforts on further improving the accuracy with hyperparameter tuning. Of course, there are a few different ways to accomplish hyperparameter tuning.

In fact, hyperparameter optimization is an open set of research that I have been somewhat involved with and is definitely worthy of it’s own series in and of itself. As a result, I will not be covering the more advanced methods here — but will cover the basic steps.

The first step, is select what parameters you can optimize.

In our case:

  • max_words
  • maxlen
  • batch_size
  • embedding_dims
  • filters
  • kernel_size
  • hidden_dims
  • epochs

Then, fix any you don’t intend to optimize over:

  • max_words
  • maxlen
  • batch_size
  • embedding_dims
  • filters
  • kernel_size
  • hidden_dims
  • epochs

Hyperparameter Search

After selecting which parameters to optimize, there are two approaches often used grid search and random search. Neither are the best search, but they are easy to implement.

In grid search, each parameter has a vector of values and we search the grid of possible outcomes. Then, most of the values are fixed and one of the vectors is iterated over at a time.

  • batch_size: [ 32, 64, 128 ]
  • embedding_dims: [ 50, 75, 100 ]
  • filters: [ 50, 100, 150, 200, 250, 300, 350 ]
  • kernel_size: [ 3, 5, 7, 10 ]
  • hidden_dims: [ 50, 100, 150, 200, 250, 300, 350 ]
  • epochs: [ 3, 5, 7 ]

In random search, each parameter has a range and is ideally a continuous variable (i.e. doesn’t have a step size, unless required) for each of the values (typically smallest possible value, i.e. 1). If a step size is required, then, the values are randomly selected. In our case, all the parameters are integers, so the “random” nature is rather limited:

  • batch_size: [ 32 – 128 ]
  • embedding_dims: [ 50 – 100 ]
  • filters: [ 50 – 350 ]
  • kernel_size: [ 3 – 10 ]
  • hidden_dims: [ 50 – 350 ]
  • epochs: [ 3 – 7 ]

It’s recommended to use random search when deciding between these methods, as it’s more likely to find a better set of parameters faster. This is important as these models can often take days to train and may get stopped early. Grid search is typically implemented as a for loop through each array in order, which means some parameters are never even adjusted if an early stop occurs.

Hyperparameter Results

In terms of results, I ran for an arbitrary number of times, repeating each configuration five times and averaging the results.

The top five results, full results on the github repo:

Accuracy Speed Batch Size Embedding Dims Filters Kernel Hidden Dims Epochs
99.40% 26 μs/step 64 75 100 5 350 7
99.36% 40 μs/step 64 50 250 10 150 5
99.33% 25 μs/step 64 75 75 5 350 5
99.31% 59 μs/step 64 100 350 5 300 3
99.29% 25 μs/step 64 50 100 7 350 5

Woo! Finally, we broke 99% accuracy in sentence type classification and with a speed matching the fastest performing model (FastText).

Consequently, the CNN is now clearly the best model and meets our >99% accuracy goal, “solving” our sentence type classification problem.

Model Accuracy Train Speed Classification Speed
Dict 85% Fastest Fastest
CNN 99.40% Fast (200 μs/step) Very Fast (26 μs/step)
MLP 96.5% Very Fast (60 μs/step) Very Fast (42 μs/step)
FastText (1-gram) 94.40% Fast (117 μs/step) Very Fast (26 μs/step)
FastText (2-gram) 95.59% Fast (196 μs/step) Very Fast (26 μs/step)
RNN (LSTM) 98.49% Very Slow (7000 μs/step) Very Slow (1000 μs/step)

Most importantly, hyperparameter tuning was minimal work.

Setting up the tuning only requires a few lines of code, then go get some coffee, go to bed, etc. When you come back the model will have improved accuracy. In this case, the model improvement cut classification time by 50% and increasing classification accuracy by 2%!

Clearly, a very large return on investment.

Thus, it is always recommended hyperparameter tuning should occur. Especially, when using neural networks, as they can be very sensitive to the input parameters.

Saving a Model (and Word Embeddings) to Disk

Most importantly, we also need to save the most accurate models for later use!

As a result, I have added an example in the github repo of saving a model. It’s important to note the word embeddings must also be imported and exported, otherwise the model will have a different mapping for the words and the model results will be no better than random.

In terms of saving the model, Keras (2.2.4) makes this easy:

That’s it, the code above will export and import the model and is in the script sentence_cnn_model_saving.py in the github repo.

However, that’s only half of the required data. We also need to export and import the embeddings (which match the model).

It is also not difficult, provided your word embeddings are in a dictionary mapping of the form:

{word => embedding }

After that, the code below should be able to import or export the embeddings as a json object:

Closing Remarks

Certainly, there is a ton of content related to neural networks on the internet and I hope you found my take insightful. I do this for a living and wished to share the bare-bones of what I do. Hopefully, you can start using neural networks yourself.

Please, let me know if you have any questions or suggestions. I’m happy to assist and always looking to improve!

Full Guide:

  1. Acquiring & formatting data for deep learning applications
  2. Word embedding and data splitting
  3. Bag-of-words to classify sentence types (Dictionary)
  4. Classify sentences via a multilayer perceptron (MLP)
  5. Classify sentences via a recurrent neural network (LSTM)
  6. Convolutional neural networks to classify sentences (CNN)
  7. FastText for sentence classification (FastText)
  8. Hyperparameter Tuning for Sentence Classification

3 thoughts on “Hyperparameter Tuning for Sentence Classification

  1. Hi Austin, thanks for you making such a great article. Can you tell me if you are releasing any of the code from your eight part series under a non-restrictive license like MIT ?

Leave a Reply

Your email address will not be published. Required fields are marked *

 characters available

Time limit is exhausted. Please reload the CAPTCHA.