The bag of words model is one particularly simple way to represent a document in numerical form before we can feed it into a machine learning algorithm. For any natural language processing task, we need a way to accomplish this before any further processing. Machine learning algorithms can’t operate on raw text; we need to convert the text to some sort of numerical representation. This process is also known as embedding the text.
There are two basic approaches to embedding a text: word vectors and document vectors. With word vectors, we represent each individual word in the text as a vector (i.e., a sequence of numbers). We then convert the whole document into a sequence of these word vectors. Document vectors, on the other hand, embed the entire document as a single vector. This is actually much easier than embedding every word individually. It also allows all of our documents to be embedded as the same size, which is convenient since many machine learning algorithms require a fixed-size input.
What Is the Bag of Words Model in NLP?
The bag of words model is a simple way to convert words to numerical representation in natural language processing. This model is a simple document embedding technique based on word frequency. Conceptually, we think of the whole document as a “bag” of words, rather than a sequence. We represent the document simply by the frequency of each word. Using this technique, we can embed a whole set of documents and feed them into a variety of different machine learning algorithms.
The bag of words model is a simple document embedding technique based on word frequency. Conceptually, we think of the whole document as a “bag” of words, rather than a sequence. We represent the document simply by the frequency of each word. For example, if we have a vocabulary of 1000 words, then the whole document will be represented by a 1000-dimensional vector, where the vector’s ith entry represents the frequency of the ith vocabulary word in the document.
Using this technique, we can embed a whole set of documents and feed them into a variety of different machine-learning algorithms. Since this embedding is so basic, it doesn’t work very well for complex tasks. But it does for simple classification problems, and its simplicity and ease of use make it an attractive choice. Let’s look into the specifics.
Bag of Words Steps With Example
As a toy example, let’s suppose our documents have a small vocabulary. For instance, Dr. Seuss’ book Green Eggs and Ham has only fifty unique words. In alphabetical order, they are: a, am, and, anywhere, are, be, boat, box, car, could, dark, do, eat, eggs, fox, goat, good, green, ham, here, house, I, if, in, let, like, may, me, mouse, not, on, or, rain, Sam, say, see, so, thank, that, the, them, there, they, train, tree, try, will, with, would, and you.
If we treat each page of the book as a single document, we can embed each of them as a 50-dimensional vector. Consider the page that reads:
I would not like them here or there.
I would not like them anywhere.
I do not like green eggs and ham.
I do not like them, Sam-I-am.
The first step is to count the frequency of each vocabulary word. “am” appears one time, “and” one time, “anywhere” one time, “do” two times, “eggs” one time, “green” one time, “ham” one time, “here” one time, “I” five times, “like” four times, “not” four times, “or” one time, “Sam” one time, “them” three times, “there” one time, and “would” two times. Every other vocabulary word appears zero times.
To turn this into a 50-dimensional vector, we set the ith entry equal to the frequency of the ith vocabulary word. For instance, “am” is the second vocabulary word and appears once in the document, so the second entry will be one. On the other hand, “a” is the first vocabulary word, and it doesn’t appear at all here, so the first entry will be zero. The whole page in vector form becomes: [0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 1, 1, 0, 5, 0, 0, 0, 4, 0, 0, 0, 4, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 3, 1, 0, 0, 0, 0, 0, 0, 2, 0].
Now let’s see how we can get Python to do all this work for us automatically.
Implementing Bag of Words in Python
The bag of words model is simple to implement in Python. See the following code:
# Assumes that 'doc' is a list of strings and 'vocab' is some iterable of vocab # words (e.g., a list or set) def get_bag_of_words(doc, vocab): # Create initial dictionary which maps each vocabulary word to a count of 0 word_count_dict = dict.fromkeys(vocab, 0) # For each word in the doc, increment its count for word in doc: word_count_dict[word] += 1 # Now, initialize the vector to a list of zeros bag =  * len(vocab) # For every vocab word, set its index equal to its count for i, word in enumerate(vocab): bag[i] = word_count_dict[word] return bag
As the first comment indicates, this code assumes we’ve already managed to represent our document as a list of separated strings. If it’s given to us as one big string instead, we’ll have to do a bit of preprocessing first. Let’s see how we’d do this with the Dr. Seuss example:
import re # Define the vocabulary vocab = ['a', 'am', 'and', 'anywhere', 'are', 'be', 'boat', 'box', 'car',\ 'could', 'dark', 'do', 'eat', 'eggs', 'fox', 'goat', 'good', 'green',\ 'ham', 'here', 'house', 'i', 'if', 'in', 'let', 'like', 'may', 'me',\ 'mouse', 'not', 'on', 'or', 'rain', 'sam', 'say', 'see', 'so', 'thank',\ 'that', 'the', 'them', 'there', 'they', 'train', 'tree', 'try', 'will',\ 'with', 'would', 'you'] # Define the document doc = ("I would not like them here or there.\n" "I would not like them anywhere.\n" "I do not like green eggs and ham.\n" "I do not like them, Sam-I-am.") # Convert to lowercase doc = doc.lower() # Split on all non-alphanumeric characters (i.e., whitespace and punctuation) doc = re.split("\W", doc) # Drop empty strings that arise from splitting doc = [s for s in doc if len(s) > 0] bag_of_words = get_bag_of_words(doc, vocab)
We use the Python regular expression library here to split the document into just words; the pattern “\W” represents any non-word character.
Benefits of Bag of Words Model
The primary benefits of this model are its simplicity and efficiency; this method is easy to implement and fast to run. Other embedding methods often require special domain knowledge or extensive pretraining.
For instance, consider an example from section 5.2.1 in the textbook Speech and Language Processing. There, they consider the task of sentiment analysis, which involves deciding if a document expresses a positive or negative attitude. They use a feature vector to represent each document which includes a lot of specialized features, such as the number of positive words that occur in the document (as defined by some preexisting bank of nice words), as well as the number of first- and second-person pronouns. It takes a lot of work to engineer a representation like this, and it only works for sentiment analysis!
On the other end, there are representations like word2vec. With this method, we have a neural network learn its own embedding for each word. This doesn’t require any hand engineering, but it does need a decent amount of data and computing power.
The bag of words model avoids both of these difficulties. There’s no manual feature engineering or expensive pretraining. It basically works right out of the box. The drawback is that it only works well on fairly simple tasks that don’t depend on understanding the surrounding context of words. Now, let’s look at where we can apply it.
Applications of Bag of Words Model
The bag of words model is typically used to embed documents in order to train a classifier. Classification is a machine learning task that categorizes a document as belonging to one of multiple types. The model doesn’t produce features that are particularly useful for other tasks like, say, question answering or summarization, because they require more semantic understanding of the document and must take into account context.
For many classification tasks, on the other hand, the mere presence and frequency of certain words is strongly indicative of what category the document is. Some common uses of the bag of words method include spam filtering, sentiment analysis, and language identification.
We can often decide whether an email is spam or not by the frequency of certain key phrases, such as “act now” and “urgent reply.” Sentiment analysis is a similar situation, with terms like “boring” and “awful” clearly suggesting a negative tone, while “beautiful” and “spectacular” are clearly positive. And if your vocabulary includes words drawn from many languages, it’s pretty easy to see when all the words in one document are coming from one language in particular without needing a deep contextual understanding of the text itself.
Once your documents have been embedded, you’ll want to feed them into a classification algorithm. Essentially, any algorithm you’re familiar with should work. If you’re looking for a simple place to start, you might try the naive Bayes classifier, logistic regression, or decision trees/random forests. All are relatively easy to implement and to understand compared to more complex neural network solutions. Good luck!