The bag-of-words model is a simple way to represent a document in numerical form before we can feed it into a machine learning algorithm. For any natural language processing task, we need a way to accomplish this before any further processing. Machine learning algorithms can’t operate on raw text; we need to convert the text to some sort of numerical representation. This process is also known as embedding the text.
What Is the Bag-of-Words Model in NLP?
The-bag-of-words model is a simple way to convert words to numerical representation by conceptualizing a document as a “bag” of words and noting the frequency of each word. Documents can then be embedded and fed into machine learning algorithms.
What Is a Bag-of-Words Model?
A bag-of-words model is a simple document embedding technique based on word frequency. Through this approach, a model conceptualizes text as a bag of words and tracks the frequency of each word. These frequencies are then converted into numerical values, which machine learning algorithms can process and use to extract features from the text.
How Bag-of-Words Models Work
Conceptually, we think of the whole document as a “bag” of words, rather than a sequence. We represent the document simply by the frequency of each word. For example, if we have a vocabulary of 1,000 words, then the whole document will be represented by a 1,000-dimensional vector, where the vector’s ith entry represents the frequency of the ith vocabulary word in the document.
Using this technique, we can embed a whole set of documents and feed them into a variety of different machine learning algorithms. Since this embedding is so basic, it doesn’t work very well for complex tasks. But it does for simple classification problems, and its simplicity and ease of use make it an attractive choice.
TF-IDF
A key issue with bag-of-words is that simply tracking the frequency of words can lead to meaningless words — words like “a,” “some” and “the” — gaining too much influence over a model. That’s where Term Frequency-Inverse Document Frequency (TF-IDF) comes into play. This approach consists of two components:
- Term Frequency: Notes the frequency of a word in one document.
- Inverse Document Frequency: Notes the rareness of a word across all documents and downplays words that occur frequently across all documents.
In short, TF-IDF is more likely to reduce the importance of a word the more times it appears across all documents. This prevents empty words from accidentally swaying a model, making TF-IDF a more focused variation of bag-of-words.
Bag-of-Words Steps With Example
Let’s suppose our documents have a small vocabulary. For instance, Dr. Seuss’ book Green Eggs and Ham has only fifty unique words. In alphabetical order, they are: a, am, and, anywhere, are, be, boat, box, car, could, dark, do, eat, eggs, fox, goat, good, green, ham, here, house, I, if, in, let, like, may, me, mouse, not, on, or, rain, Sam, say, see, so, thank, that, the, them, there, they, train, tree, try, will, with, would and you.
If we treat each page of the book as a single document, we can embed each of them as a 50-dimensional vector. Consider the page that reads:
I would not like them here or there.
I would not like them anywhere.
I do not like green eggs and ham.
I do not like them, Sam-I-am.
The first step is to count the frequency of each vocabulary word. “am” appears one time, “and” one time, “anywhere” one time, “do” two times, “eggs” one time, “green” one time, “ham” one time, “here” one time, “I” five times, “like” four times, “not” four times, “or” one time, “Sam” one time, “them” three times, “there” one time, and “would” two times. Every other vocabulary word appears zero times.
To turn this into a 50-dimensional vector, we set the ith entry equal to the frequency of the ith vocabulary word. For instance, “am” is the second vocabulary word and appears once in the document, so the second entry will be one. On the other hand, “a” is the first vocabulary word, and it doesn’t appear at all here, so the first entry will be zero. The whole page in vector form becomes: [0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 1, 1, 0, 5, 0, 0, 0, 4, 0, 0, 0, 4, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 3, 1, 0, 0, 0, 0, 0, 0, 2, 0].
Now let’s see how we can get Python to do all this work for us automatically.
Implementing Bag-of-Words in Python
The bag-of-words model is simple to implement in Python. See the following code:
# Assumes that 'doc' is a list of strings and 'vocab' is some iterable of vocab
# words (e.g., a list or set)
def get_bag_of_words(doc, vocab):
# Create initial dictionary which maps each vocabulary word to a count of 0
word_count_dict = dict.fromkeys(vocab, 0)
# For each word in the doc, increment its count
for word in doc:
word_count_dict[word] += 1
# Now, initialize the vector to a list of zeros
bag = [0] * len(vocab)
# For every vocab word, set its index equal to its count
for i, word in enumerate(vocab):
bag[i] = word_count_dict[word]
return bag
As the first comment indicates, this code assumes we’ve already managed to represent our document as a list of separated strings. If it’s given to us as one big string instead, we’ll have to do a bit of preprocessing first. Let’s see how we’d do this with the Dr. Seuss example:
import re
# Define the vocabulary
vocab = ['a', 'am', 'and', 'anywhere', 'are', 'be', 'boat', 'box', 'car',\
'could', 'dark', 'do', 'eat', 'eggs', 'fox', 'goat', 'good', 'green',\
'ham', 'here', 'house', 'i', 'if', 'in', 'let', 'like', 'may', 'me',\
'mouse', 'not', 'on', 'or', 'rain', 'sam', 'say', 'see', 'so', 'thank',\
'that', 'the', 'them', 'there', 'they', 'train', 'tree', 'try', 'will',\
'with', 'would', 'you']
# Define the document
doc = ("I would not like them here or there.\n"
"I would not like them anywhere.\n"
"I do not like green eggs and ham.\n"
"I do not like them, Sam-I-am.")
# Convert to lowercase
doc = doc.lower()
# Split on all non-alphanumeric characters (i.e., whitespace and punctuation)
doc = re.split("\W", doc)
# Drop empty strings that arise from splitting
doc = [s for s in doc if len(s) > 0]
bag_of_words = get_bag_of_words(doc, vocab)
We use the Python regular expression library here to split the document into just words; the pattern “\W” represents any non-word character.
Benefits of Bag-of-Words Models
The primary benefits of this model are its simplicity and efficiency; this method is easy to implement and fast to run. Other embedding methods often require special domain knowledge or extensive pretraining.
Consider an example from section 5.2.1 in the textbook Speech and Language Processing that covers sentiment analysis, which involves deciding if a document expresses a positive or negative attitude. A feature vector is used to represent each document, which includes a lot of specialized features, such as the number of positive words that occur in the document (as defined by some preexisting bank of nice words), as well as the number of first- and second-person pronouns. This kind of representation only works for sentiment analysis.
On the other end, there are representations like word2vec. With this method, we have a neural network learn its own embedding for each word. This doesn’t require any hand engineering, but it does need a decent amount of data and computing power.
The bag-of-words model avoids both of these difficulties. There’s no manual feature engineering or expensive pretraining. It basically works right out of the box. The drawback is that it only works well on fairly simple tasks that don’t depend on understanding the surrounding context of words.
Applications of Bag-of-Words Models
The bag-of-words model is typically used to embed documents to train a classifier. Classification is a machine learning task that categorizes a document as belonging to one of multiple types. The model doesn’t produce features that are particularly useful for other tasks like question answering or summarization because they require more semantic understanding of the document and must account for context.
For many classification tasks, on the other hand, the mere presence and frequency of certain words is strongly indicative of what category the document is. Some common uses of the bag-of-words method include:
- Spam filtering: We can often decide whether an email is spam or not by the frequency of certain key phrases, such as “act now” and “urgent reply.”
- Sentiment analysis: Sentiment analysis is a similar situation, with terms like “boring” and “awful” clearly suggesting a negative tone, while “beautiful” and “spectacular” are clearly positive.
- Language identification: If your vocabulary includes words drawn from many languages, it’s pretty easy to see when all the words in one document come from a particular language without a deep contextual understanding of the text itself.
Once your documents have been embedded, you’ll want to feed them into a classification algorithm. Essentially, any algorithm you’re familiar with should work. If you’re looking for a simple place to start, you might try the naive Bayes classifier, logistic regression, or decision trees and random forests. All are relatively easy to implement and understand compared to more complex neural network solutions.
Frequently Asked Questions
What is the difference between TF-IDF and bag-of-words?
A bag-of-words model helps machine learning algorithms extract features from text by conceptualizing the text as a bag of words and simply counting the frequency of each word. TF-IDF goes a step further, reducing the importance of words that show up more across all texts. This is to account for the flaw in bag-of-words where meaningless words like “a” and “the” can gain too much influence over the model.
What are the advantages of bag-of-words?
Bag-of-words models don’t require intensive domain knowledge or pretraining, making them easy to implement and use. As a result, bag-of-words is ideal for simple tasks like sentiment analysis, spam filtering and language identification.
What are the four steps of bag-of-words?
Implement a bag-of-words model by following these four steps:
- Compile data.
- Preprocess the data, if needed.
- Develop a list of all words in the model’s vocabulary.
- Score the words in each document and create document vectors.