Term frequency-inverse document frequency (TF-IDF) is a natural language processing (NLP) technique that’s used to measure the importance of different words in a sentence. It cancels out the incapabilities of the bag of words technique, which is good for text classification or for helping a machine learning model read words.

Term Frequency-Inverse Document Frequency (TF-IDF) Explained

TF-IDF is a natural language processing (NLP) technique that’s used to evaluate the importance of different words in a sentence. It’s useful in text classification and for helping a machine learning model read words.   

In this article, we’ll explore how to create a TF-IDF model in Python from scratch. Here’s what we’ll cover in the article:

  • Terminology
  • Term frequency(TF)
  • Document frequency (DF)
  • Inverse document frequency (IDF)
  • TF-IDF Implementation in Python

 

TF-IDF Terms to Know

Below are the terms you’ll need to understand to create a TF-IDF model.

  • t — term (word).
  • d — document (set of words).
  • N — count of corpus.
  • corpus — the total document set.

More on AIA Guide to Image Captioning in Deep Learning

 

What Is Term Frequency (TF) in TF-IDF?

Suppose we have a set of English text documents and wish to rank which document is most relevant to the query, “Data Science is awesome!” One way to start out is to eliminate documents that don’t contain all three words: “Data,” “is,” “Science” and “awesome,” but this still leaves too many documents. To further distinguish them, we might count the number of times each term occurs in each document. The number of times a term occurs in a document is called its term frequency.

The weight of a term that occurs in a document is simply proportional to the term frequency.

 

TF Formula

tf(t,d) = count of t in d / number of words in d

 

What Is Document Frequency in TF-IDF?

Document frequency (DF) measures the importance of a document in a whole set of corpus. This is very similar to TF. The only difference is that TF is a frequency counter for a term t in document d, whereas DF is the count of occurrences of term t in the document set N. In other words, DF is the number of documents in which the word is present. We consider one occurrence if the term consists in the document at least once. We don’t need to know the number of times the term is present.

 

DF Formula

df(t) = occurrence of t in documents

 

What Is Inverse Document Frequency (IDF) in TF-IDF?

While computing TF, all terms are considered equally important. However, certain terms, such as “is,” “of,” and “that,” may appear a lot of times but have little importance. We need to weigh down the frequent terms while scaling up the rare ones. When we compute IDF, an inverse document frequency factor is incorporated, which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that rarely occur.

IDF is the inverse of the document frequency, which measures the informativeness of term t. When we calculate IDF, it will be very low for the most occurring words, such as stop words like “is.” That’s because those words are present in almost all of the documents, and N/df will give a very low value to words like that.

This finally gives us what we want, a relative weightage:

idf(t) = N/df

Now, there are few other problems with the IDF. If you have a large corpus, say 100,000,000, the IDF value explodes. To avoid this, we take the log of IDF.

During the query time, when a word that’s not in the vocabulary occurs, the DF will be 0. Since we can’t divide by 0, we smoothen the value by adding 1 to the denominator.

And that’s the final formula.

 

IDF Formula

idf(t) = log(N/(df + 1))

 

How Does TF-IDF Work?

TF-IDF is a measure used to evaluate how important a word is to a document in a collection or corpus. There are many different variations of TF-IDF, but for now, let’s concentrate on the basic version.

 

TF-IDF Formula

tf-idf(t, d) = tf(t, d) * log(N/(df + 1))

 

How to Create a TF-IDF Model in Python

To make a TF-IDF model from scratch in Python, let’s imagine these two sentences from a different document:

  • first_sentence: “Data Science is the sexiest job of the 21st century”.
  • second_sentence: “machine learning is the key for data science”.

 

1. Create the TF Function

First, we have to create the TF function to calculate total word frequency for all documents. 

As usual, we should import the necessary libraries:

import pandas as pd
import sklearn as sk
import math

Next, let’s load our sentences and combine them together in a single set:

first_sentence = "Data Science is the sexiest job of the 21st century"
second_sentence = "machine learning is the key for data science"

#split so each word have their own string

first_sentence = first_sentence.split(" ")
second_sentence = second_sentence.split(" ")#join them to remove common duplicate words
total= set(first_sentence).union(set(second_sentence))

print(total)

Output:

{'data', 'Science', 'job', 'sexiest', 'the', 'for', 'science', 'machine', 'of', 'is', 'learning', '21st', 'key', 'Data', 'century'}

Now, lets add a way to count the words using a dictionary key-value pairing for both sentences:

wordDictA = dict.fromkeys(total, 0) 
wordDictB = dict.fromkeys(total, 0)

for word in first_sentence:
    wordDictA[word]+=1
    
for word in second_sentence:
    wordDictB[word]+=1

Then, we put them in a dataframe and view the result:

pd.DataFrame([wordDictA, wordDictB])
Dataframe for words in the sentence
Dataframe for words in the sentence. | Image: Yassine Hamdaoui

Now, let’s write the TF function:

def computeTF(wordDict, doc):
    tfDict = {}
    corpusCount = len(doc)
    for word, count in wordDict.items():
        tfDict[word] = count/float(corpusCount)
    return(tfDict)

#running our sentences through the tf function:

tfFirst = computeTF(wordDictA, first_sentence)
tfSecond = computeTF(wordDictB, second_sentence)

#Converting to dataframe for visualization

tf = pd.DataFrame([tfFirst, tfSecond])

And this is the expected output:

Term frequency dataframe results.
Term frequency dataframe results. | Image: Yassine Hamdaoui

That’s all for the TF formula.

 

2. Eliminate the Stop Words

Now, let’s talk about how to eliminate stop words, which are the most commonly occuring words that don’t give any additional value to the document vector. Removing these will increase computation and space efficiency.

nltk library has a method to download the stop words. So, instead of explicitly mentioning all the stop words ourselves, we can just use the nltk library and iterate over all the words and remove the stop words. There are many efficient ways to do this, but I’ll just give a simple method.

Here’s a sample of a stop words in the English language:

Stop words list in the English language.
Stop words list in the English language. | Image: Yassine Hamdaoui

This is the code you can use to download stop words and remove them.

import nltk

nltk.download('stopwords')

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

filtered_sentence = [w for w in wordDictA if not w in stop_words]

print(filtered_sentence)

Output:

['data', 'Science', 'job', 'sexiest', 'science', 'machine', 'learning', '21st', 'key', 'Data', 'century']

 

3. Create the IDF Function

Now that we’ve finished the TF section, let’s move onto the IDF part:

def computeIDF(docList):
    idfDict = {}
    N = len(docList)
    
    idfDict = dict.fromkeys(docList[0].keys(), 0)
    for word, val in idfDict.items():
        idfDict[word] = math.log10(N / (float(val) + 1))
        
    return(idfDict)
#inputing our sentences in the log file
idfs = computeIDF([wordDictA, wordDictB])

 

4. Calculate TF-IDF

And now that we’ve implemented the idf formula, let’s finish with calculating the TF-IDF:

def computeTFIDF(tfBow, idfs):
    tfidf = {}
    for word, val in tfBow.items():
        tfidf[word] = val*idfs[word]
    return(tfidf)
#running our two sentences through the IDF:

idfFirst = computeTFIDF(tfFirst, idfs)
idfSecond = computeTFIDF(tfSecond, idfs)
#putting it in a dataframe
idf= pd.DataFrame([idfFirst, idfSecond])
print(idf)

Output:

TF-IDF model output dataframe for the example sentence.
TF-IDF model output dataframe for the example sentence. | Image: Yassine Hamdaoui

That was a lot of work. But it’s handy to know if you’re asked to code TF-IDF from scratch in the future. 

A tutorial on TF-IDF. | Video: Krish Naik

More on AILemmatization in Natural Language Processing (NLP) and Machine Learning

 

How to Create a TF-IDF Model Using Sklearn?

It’s a lot easier to create a TF-IDF model using the Sklearn library than doing it from scratch. Let’s look at the example below:

#first step is to import the library

from sklearn.feature_extraction.text import TfidfVectorizer
#for the sentence, make sure all words are lowercase or you will run #into error. for simplicity, I just made the same sentence all #lowercase

firstV= "Data Science is the sexiest job of the 21st century"
secondV= "machine learning is the key for data science"

#calling the TfidfVectorizer
vectorize= TfidfVectorizer()
#fitting the model and passing our sentences right away:

response= vectorize.fit_transform([firstV, secondV])

And this is the expected output:

TF-IDF results using Sklearn.
TF-IDF results using Sklearn. | Image: Yassine Hamdaoui

In this post, we’ve covered how to use Python and a NLP technique known as term frequency-inverse document frequency (TF-IDF) to summarize documents. We used Sklearn along with nltk to accomplish this task.

Expert Contributors

Built In’s expert contributor network publishes thoughtful, solutions-oriented stories written by innovative tech professionals. It is the tech industry’s definitive destination for sharing compelling, first-person accounts of problem-solving on the road to innovation.

Learn More

Great Companies Need Great People. That's Where We Come In.

Recruit With Us