Lemmatization is one of the most common text pre-processing techniques used in natural language processing (NLP) and machine learning in general. Lemmatization is not that much different than the stemming of words in NLP. In both stemming and lemmatization, we try to reduce a given word to its root word. The root word is called a stem in the stemming process, and it’s called a lemma in the lemmatization process. But there are a few more differences to the two than that. Let’s see what those are.
What Is Lemmatization?
Lemmatization is a text pre-processing technique used in natural language processing (NLP) models to break a word down to its root meaning to identify similarities. For example, a lemmatization algorithm would reduce the word better to its root word, or lemme, good.
How Is Lemmatization Different From Stemming?
In stemming, a part of the word is just chopped off at the tail end to arrive at the stem of the word. There are different algorithms used to find out how many characters have to be chopped off, but the algorithms don’t actually know the meaning of the word in the language it belongs to. In lemmatization, the algorithms do have this knowledge. In fact, you can even say that these algorithms refer to a dictionary to understand the meaning of the word before reducing it to its root word, or lemma.
So, a lemmatization algorithm would know that the word better is derived from the word good, and hence, the lemme is good. But a stemming algorithm wouldn’t be able to do the same. There could be over-stemming or under-stemming, and the word better could be reduced to either bet, or bett, or just retained as better. But there is no way in stemming that can reduce better to its root word good. This is the difference between stemming and lemmatization.
Advantages and Disadvantages of Lemmatization
As you can probably tell by now, the obvious advantage of lemmatization is that it is more accurate than stemming. So, if you’re dealing with an NLP application such as a chat bot or a virtual assistant, where understanding the meaning of the dialogue is crucial, lemmatization would be useful. But this accuracy comes at a cost.
Because lemmatization involves deriving the meaning of a word from something like a dictionary, it’s very time consuming. So most lemmatization algorithms are slower compared to their stemming counterparts. There is also a computation overhead for lemmatization, however, in most machine learning problems, computational resources are rarely a cause of concern.
Should You Choose Lemmatization Over Stemming?
Well, I can’t answer that question. Lemmatization and stemming are both much more complex than what I’ve made them appear here. There are a lot more things to consider about both the approaches before making a decision. But I’ve rarely seen any significant improvement in efficiency and accuracy of a product that uses lemmatization over stemming. In most cases, at least according to my knowledge, the overhead that lemmatization demands is not justified. So, it depends on the project in question. But I want to put out a disclaimer here. Most of the work I have done in NLP is for text classification, and that’s where I haven’t seen a significant difference. There are applications where the overhead of lemmatization is perfectly justified, and in fact, lemmatization would be a necessity.