UPDATED BY
Hal Koss | Jul 19, 2022

Natural Language Processing (NLP) is a subfield of machine learning that makes it possible for computers to understand, analyze, manipulate and generate human language. You encounter NLP machine learning in your everyday life — from spam detection, to autocorrect, to your digital assistant (“Hey, Siri?”). You may even encounter NLP and not even realize it. In this article, I’ll show you how to develop your own NLP projects with Natural Language Toolkit (NLTK) but before we dive into the tutorial, let’s look at some every day examples of NLP. 

Examples of NLP Machine Learning

  • Email spam filters
  • Auto-correct
  • Predictive text
  • Speech recognition
  • Information retrieval
  • Information extraction
  • Machine translation
  • Text simplification
  • Sentiment analysis
  • Text summarization
  • Query response
  • Natural language generation

More From Our ExpertsArtificial Intelligence vs. Machine Learning vs. Deep Learning

 

Get Started With NLP

NLTK is a popular open-source suite of Python libraries. Rather than building all of your NLP tools from scratch, NLTK provides all common NLP tasks so you can jump right in. In this tutorial, I’ll show you how to perform basic NLP tasks and use a machine learning classifier to predict whether an SMS is spam (a harmful, malicious, or unwanted message or ham (something you might actually want to read. You can find all the code below in this Github Repo.

First things first, you’ll want to install NLTK

Type !pip install nltk in a Jupyter Notebook. If it doesn’t work in cmd, type conda install -c conda-forge nltk. You shouldn’t have to do much troubleshooting beyond that.

Importing NLTK Library

import nltk
nltk.download()

This code gives us an NLTK downloader application which is helpful in all NLP Tasks.

nlp machine learning downloader application window screenshot

As you can see, I’ve already installed Stopwords Corpus in my system, which helps remove redundant words. You’ll be able to install whatever packages will be most useful to your project.

NLP Machine Learning in 10 Minutes

 

Prepare Your Data for NLP

Reading In-text Data 

Our data comes to us in a structured or unstructured format. A structured format has a well-defined pattern. For example Excel and Google Sheets are structured data. Alternatively, unstructured data has no discernible pattern (e.g. images, audio files, social media posts). In between these two data types, we may find we have a semi-structured format. Language is a great example of semi-structured data.

nlp machine learning
Access raw code here

As we can see from the code above, when we read semi-structured data, it’s hard for a computer (and a human!) to interpret. We can use Pandas to help us understand our data.

nlp machine learning
Access raw code here

With the help of Pandas we can now see and interpret our semi-structured data more clearly.

 

How to Clean Your Data

Cleaning up your text data is necessary to highlight attributes that we’re going to want our machine learning system to pick up on. Cleaning (or pre-processing) the data typically consists of three steps.

How to Clean Your Data for NLP

  1. Remove punctuation
  2. Tokenize
  3. Remove stop words
  4. Stem
  5. Lemmatize

1. Remove Punctuation

Punctuation can provide grammatical context to a sentence which supports human understanding. But for our vectorizer, which counts the number of words and not the context, punctuation does not add value. So we need to remove all special characters. For example, “How are you?” becomes: How are you

Here’s how to do it:

nlp machine learning

In body_text_clean, you can see we’ve removed all punctuation. I’ve becomes Ive and WILL!! Becomes WILL.

2.Tokenize

Tokenizing separates text into units such as sentences or words. In other words, this function gives structure to previously unstructured text. For example: Plata o Plomo becomes ‘Plata’,’o’,’Plomo’.

nlp machine learning
Access raw code here.

In body_text_tokenized, we’ve generated all the words as tokens.

3. Remove Stop Words

Stop words are common words that will likely appear in any text. They don’t tell us much about our data so we remove them. Again, these are words that are great for human understanding, but will confuse your machine learning program. For example: silver or lead is fine for me becomes silver, lead, fine.

nlp machine learning
Access raw code here.

In body_text_nostop, we remove all unnecessary words like “been,” “for,” and “the.”

4. Stem

Stemming helps reduce a word to its stem form. It often makes sense to treat related words in the same way. It removes suffixes like “ing,” “ly,” “s” by a simple rule-based approach. Stemming reduces the corpus of words but often the actual words are lost, in a sense. For example: “Entitling” or  “Entitled” become  “Entitl.”

Note: Some search engines treat words with the same stem as synonyms. 

nlp machine learning
Access raw code here.

In body_text_stemmed, words like entry and goes are stemmed to entri and goe even though they don’t mean anything in English.

5. Lemmatize

Lemmatizing derives the root form (“lemma”) of a word. This practice is more robust than stemming because it uses a dictionary-based approach (i.e a morphological analysis) to the root word. For example, “Entitling” or “Entitled” become “Entitle.”

In short, stemming is typically faster as it simply chops off the end of the word, but without understanding the word’s context. Lemmatizing is slower but more accurate because it takes an informed analysis with the word’s context in mind.

nlp machine learning
Access raw code here.

In body_text_stemmed, we can see words like “chances” are lemmatized to “chance” but stemmed to “chanc.”

Want More Data Science Tutorials?Need to Automate Your Data Analysis? Here’s How.

 

Vectorize Data

Vectorizing is the process of encoding text as integers to create feature vectors so that machine learning algorithms can understand language.

Methods of Vectorizing Data for NLP

  1. Bag-of-Words
  2. N-Grams
  3. TF-IDF

1. Bag-Of-Words

Bag-of-Words (BoW) or CountVectorizer describes the presence of words within the text data. This process gives a result of one if present in the sentence and zero if absent. This model therefore, creates a bag of words with a document-matrix count in each text document.

from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(analyzer=clean_text)
X_counts = count_vect.fit_transform(data['body_text'])
print(X_counts.shape)
print(count_vect.get_feature_names())

We apply BoW to the body_text so the count of each word is stored in the document matrix. (Check the repo).

2. N-Grams

N-grams are simply all combinations of adjacent words or letters of length n that we find in our source text. N-grams with n=1 are called unigrams, n=2 are bigrams, and so on. 

nlp machine learning
Access raw code here.

Unigrams usually don’t contain much information as compared to bigrams or trigrams. The basic principle behind N-grams is that they capture which letter or word is likely to follow a given word. The longer the N-gram (higher n), the more context you have to work with.

from sklearn.feature_extraction.text import CountVectorizer

ngram_vect = CountVectorizer(ngram_range=(2,2),analyzer=clean_text) # It applies only bigram vectorizer
X_counts = ngram_vect.fit_transform(data['body_text'])
print(X_counts.shape)
print(ngram_vect.get_feature_names())

We’ve applied N-Gram to the body_text, so the count of each group of words in a sentence is stored in the document matrix. (Check the repo).

3. TF-IDF

TF-IDF computes the relative frequency with which a word appears in a document compared to its frequency across all documents. It’s more useful than term frequency for identifying key words in each document (high frequency in that document, low frequency in other documents).

Note: We use TF-IDF for search engine scoring, text summarization and document clustering. Check my article on recommender systems  to learn more about TF-IDF.

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(data['body_text'])
print(X_tfidf.shape)
print(tfidf_vect.get_feature_names())

We’ve applied TF-IDF in the body_text, so the relative count of each word in the sentences is stored in the document matrix. (Check the repo).

Note: Vectorizers output sparse matrices in which most entries are zero. In the interest of efficient storage, a sparse matrix will be stored if you’re only storing locations of the non-zero elements.

How to Make the Most of Your Graphs7 Ways to Tell Powerful Stories With Your Data Visualization

 

Feature Engineering

Feature Creation

Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Because feature engineering requires domain knowledge, feature can be tough to create, but they’re certainly worth your time.

nlp machine learning
Access raw code here.
  • body_len shows the length of words excluding whitespaces in a message body.
  • punct% shows the percentage of punctuation marks in a message body.

Is Your Feature Worthwhile? 

nlp machine learning
Access raw code here.

We can see clearly that spams have a high number of words compared to hams. So body_len is a good feature to distinguish.

Now let’s look at punct%.

nlp machine learning
Access raw code here.

Spam has a higher percentage of punctuations but not that far away from ham. This is surprising given spam emails often contain a lot of punctuation marks. Nevertheless, given the apparent difference, we can still call this a useful feature. 

Need to Optimize Your Hardware? We Have a Tutorial for That.Create a Linux Virtual Machine on Your Computer

 

Building Machine Learning Classifiers 

Model Selection

We use an ensemble method of machine learning. By using multiple models in concert, their combination produces more robust results than a single model (e.g. support vector machine, Naive Bayes). Ensemble methods are the first choice for many Kaggle competitions. We construct random forest algorithms (i.e. multiple random decision trees) and use the aggregates of each tree for the final prediction. This process can be used for classification as well as regression problems and follows a random bagging strategy.

  • Grid-search: This model exhaustively searches overall parameter combinations in a given grid to determine the best model.
  • Cross-validation: This model divides a data set into k subsets and repeats the method k times.This model also uses a different subset as the test set in each iteration.
nlp machine learning
Access raw code here.

The mean_test_score for n_estimators=150 and max_depth gives the best result. Here, n_estimators is the number of trees in the forest (group of decision trees) and max_depth is the max number of levels in each decision tree.

nlp machine learning
Access raw code here.

Similarly, the mean_test_score for n_estimators=150 and max_depth=90 gives the best result.

Future Improvements

You could use GradientBoosting, XgBoost for classifying. GradientBoosting will take a while because it takes an iterative approach by combining weak learners to create strong learners thereby focusing on mistakes of prior iterations. In short, compared to random forest, GradientBoosting follows a sequential approach rather than a random parallel approach.

More From Badreesh ShettyAn In-Depth Guide to How Recommender Systems Work

 

Our NLP Machine Learning Classifier

We combine all the above-discussed sections to build a Spam-Ham Classifier.

nlp machine learning

Random forest provides 97.7 percent accuracy. We obtain a high-value F1-score from the model. This confusion matrix tells us that we correctly predicted 965 hams and 123 spams. We incorrectly identified zero hams as spams and 26 spams were incorrectly predicted as hams. This margin of error is justifiable given the fact that detecting spams as hams is preferable to potentially losing important hams to an SMS spam filter.

Spam filters are just one example of NLP you encounter every day. Here are others that influence your life each day (and some you may want to try out!). Hopefully this tutorial will help you try more of these out for yourself.

  • Email spam filters — your “junk” folder

  • Auto-correct — text messages, word processors

  • Predictive text — search engines, text messages

  • Speech recognition — digital assistants like Siri, Alexa 

  • Information retrieval — Google finds relevant and similar results

  • Information extraction — Gmail suggests events from emails to add on your calendar

  • Machine translation — Google Translate translates language from one language to another

  • Text simplificationRewordify simplifies the meaning of sentences

  • Sentiment analysisHater News gives us the sentiment of the user

  • Text summarization — Reddit’s autotldr gives a summary of a submission

  • Query response — IBM Watson’s answers to a question

  • Natural language generation — generation of text from image or video data

Expert Contributors

Built In’s expert contributor network publishes thoughtful, solutions-oriented stories written by innovative tech professionals. It is the tech industry’s definitive destination for sharing compelling, first-person accounts of problem-solving on the road to innovation.

Learn More

Great Companies Need Great People. That's Where We Come In.

Recruit With Us