Fuzzy string matching is the process of finding strings that approximately match a pattern. It does this by calculating the number of operations needed to transform one string to the other.
Why is this important? Well, as a data scientist, you often need to to retrieve information from various sources by either leveraging publicly available API’s, asking for data or by simply scraping your own data from a web page. All this information is useful if we are able to combine it and not have any duplicates in the data. But how do we make sure that there are no duplicates?
What Is Fuzzy String Matching?
Fuzzy string matching is the process of finding strings that match a pattern. It’s built using the Levenshtein distance, in which it calculates the number of operations needed to transform one string into another. For example, fuzzy string matching will be able to recognize that the names Barack Obama and Barack H. Obama refer to the same person.
I know, “You can just use a function that retrieves all the unique information thus removing duplicates.” Well, that’s one way, but if we’re retrieving the names of the most famous people in the world, our function probably can’t tell that a name like “Barack Obama” is the same as “Barack H. Obama.” While these names are different, they’re likely referring to the same person. So, how do we match these names?
This is where fuzzy string matching comes in. This post will explain what fuzzy string matching is together with its use cases and give examples using Python’s Library FuzzyWuzzy.
Fuzzy Logic vs. Boolean Logic
Fuzzy logic is a form of multi-valued logic that deals with reasoning that is approximate rather than fixed and exact. Fuzzy logic values range between 1 and 0, i.e. the value may range from completely true to completely false. In contrast, Boolean logic is a two-valued logic, true or false, that’s usually denoted as 1 and 0, respectively. It deals with reasoning that is fixed and exact. Fuzzy logic tends to reflect how people think and attempts to model our decision making, hence it’s now leading to new intelligent systems (expert systems).
So, if we are comparing two strings using fuzzy logic, we would be trying to answer the question, “How similar are string A and string B?” In Boolean logic, we’d rephrase it as, “Are string A and string B the same?”
Fuzzy String Matching Explained
Fuzzy string matching, also known as approximate string matching, is the process of finding strings that approximately match a pattern. The process has various applications, such as spell checking, DNA analysis and detection, spam detection and plagiarism detection, etc.
How to Do Fuzzy String Matching in Python With FuzzyWuzzy
FuzzyWuzzy is a Python library that uses Levenshtein distance to calculate the differences between sequences and patterns. It was developed and also open-sourced by SeatGeek, a service that finds event tickets from all over the internet and showcases them on one platform. The problem they faced involved same events being given different labels. This is the same as the example I gave at the beginning of the post, in which an entity, such as a person’s name, can be labeled differently on different sources.
How to Install FuzzyWuzzy
To install the library, you can use pip:
pip install fuzzywuzzy pip install python-Levenshtein
First we have to import the
from fuzzywuzzy import fuzz from fuzzywuzzy import process
Now, we can get the similarity score of two strings by using the following methods two
methods ratio() or
fuzz.ratio("Catherine M Gitau","Catherine Gitau") #91 fuzz.partial_ratio("Catherine M. Gitau","Catherine Gitau") #100
You’re probably wondering why the scores are different. This is because the
fuzz.ratio() method just calculates the edit distance between some ordering of the token in both input strings using the difflib.ratio. The
fuzz.partial_ratio() takes in the shortest string, which in this case is “Catherine Gitau” (length 14) , then matches it with all the sub-strings of length (14) in “Catherine M. Gitau.” That means matching with “Catherine Gitau,” which returns 100 percent. You can play around with the strings until you get the gist.
What if we switched up two names in one string? In the following example, I’ve interchanged the name “Catherine Gitau” to “Gitau Catherine.” Let’s see the scores:
fuzz.ratio("Catherine M Gitau","Gitau Catherine") #55 fuzz.partial_ratio("Catherine M. Gitau","Gitau Catherine") #60
We see that both methods are giving out low scores, this can be rectified by using
token_sort_ratio() method. This method attempts to account for similar strings that are out of order. For example, if we used the above strings again but using
token_sort_ratio(), we get the following:
fuzz.token_sort_ratio("Catherine Gitau M.", "Gitau Catherine") #94
As you can see, we get a high score of 94.
This article has introduced fuzzy string matching, which is a well known problem that is built on Levenshtein distance. It calculates how similar two strings are. This can also be calculated by finding out the number of operations needed to transform one string to the other, e.g. with the name “Barack,” one might spell it as “Barac.” Only one operation is needed to correct this, i.e. adding a K at the end. You can try this out using the
stringdist library in R as such:
adist("Barack", "Barac") #