Bag-of-Words to Classify Sentence Types

This is the third article in an eight part series on a practical guide to using neural networks, applied to real world problems.

This post is specifically about developing a naive solution using key word search (bag-of-words) to classify sentence types.

Full Guide:

Acquiring & formatting data for deep learning applications
Word embedding and data splitting
Bag-of-words to classify sentence types (Dictionary)
Classify sentences via a multilayer perceptron (MLP)
Classify sentences via a recurrent neural network (LSTM)
Convolutional neural networks to classify sentences (CNN)
FastText for sentence classification (FastText)
Hyperparameter tuning for sentence classification

Fastest Potential Solution

Our goal in this article is to make the simplest and fastest possible solution to classify sentence types.

This accomplishes two things:

It creates a baseline from which we can compare future models
Ensures we don’t unnecessarily create complex models (i.e. if it works simply, use that)

Often for text based problems, the simplest solution is taking a bucket of words associated with a given classification and seeing if any of the “bucket of words” are in the comment.

For our case, there are actually some clear giveaways — i.e. words to search for in the data sample / comment:

Questions: who, what, where, when, why, how, which, can, ?

Commands: please, don’t, shut, fold, open, close, mix, turn, pour, fill, put, add, chop, slice, serve, spread, get, heat, grill, hold, swim, swing, listen, hold, pick, take, fetch, roll, jump, stand, crouch, hide, crack, write, use, order, draw, paint, set, eat, drink, stick, cook, bring, sit, stop, play, buy, shop, explain, tidy, move, switch, improve, behave, sort, go, fly, flip

It is also possible to go a bit further and inspect the data. For instance, we could list the top used words for each sentence classification and use those. In this case, that leaves many “stop words“, i.e. words that are common to the point that false-classifications are very likely.

Bag of Words

Due to our case having a fairly predictable list of keywords. What we should start with is the simplest solution: keyword search.

The algorithm for this is basic and is referred to as bag-of-words:

Obtain comments & associated categorical label(s)
Form a python dictionary of keywords for each category
Search each comment for key words
If keyword found, label with associated label
If no keyword is found, label as “statement”, i.e. the base category

The Python (3.6+) code for this is below:

The output of our attempted solution (ratio of correct classifications):

Questions: 0.91
Statements: 0.75
Commands: 0.38
Accuracy (overall): 0.85

This means our algorithm successfully labels the comment correctly 85% of the time. Honestly, maybe that is good enough!

What’s Good Enough?

One thing I always do before sitting down to solve a problem is to set clearly defined goals. Could be customers goals, my goals, teams goals, etc. Without a goal we don’t know what “success” is, so lets define it for our use case:

We needed our bots to understand when a question, statement, or command sent to our bot(s)

We need near 100% accuracy. This ensures our bot doesn’t break immersion, because we as humans, really cannot tolerate failure.

With that in mind, 85% accuracy at classifying sentences is fairly good. However, it would definitely break immersion for our users, so it’s no where near good enough. The next question is:

Should we continue to try to improve the bag of words method?

Short answer: no.

The purpose of this model is to test how close we can get with the most bare minimum effort.

We succeeded in that goal, and determined that 85% of the time a basic word search solves our problem. This indicates that any kind of machine learning model should have a very high probability of success potentially pushing near 99+% accuracy (which is what we are interested in).

Up Next…

The next article will cover our first step into neural network development using a multi-layer perception (MLP), one of the more basic neural network designs.

Full Guide:

Acquiring and Formatting for Deep Learning Applications
Word embedding and data splitting
Naive solution to tackle classifying sentence types (Dictionary)
Classify sentences via a multilayer perceptron (MLP) (Next Article)
Classify sentences via a recurrent neural network (LSTM)
Convolutional neural networks to classify sentences (CNN)
FastText for sentence classification (FastText)
Hyperparameter tuning for sentence classification

Fastest Potential Solution

Bag of Words

What’s Good Enough?

Up Next…

Full Guide:

Leave a Reply Cancel reply