Fuzzy string matching is the process of finding strings that approximately match a pattern. It does this by calculating the number of operations needed to transform one string to the other.
Why is this important? Data scientists often need to retrieve information from various sources by either leveraging publicly available APIs, asking for data or by simply scraping data from a web page. All this information is useful if we can combine it and not have any duplicates in the data. But how do we make sure that there are no duplicates?
What Is Fuzzy String Matching?
Fuzzy string matching is the process of finding strings that approximately match a pattern. It’s built using the Levenshtein distance, in which it calculates the number of operations needed to transform one string into another.
Standard deduplication methods can be used to identify repetitive data, but they only detect exact matches. If we’re retrieving the names of the most famous people in the world, these techniques probably won’t pick up that a name like “Barack Obama” is the same as “Barack H. Obama.” While these names are different, they’re likely referring to the same person. So, how do we match these names?
This is where fuzzy string matching comes in. This post will explain what fuzzy string matching, along with its use cases and examples using the Python library FuzzyWuzzy.
Fuzzy Logic vs. Boolean Logic
Fuzzy logic is a form of multi-valued logic that deals with reasoning that is approximate rather than fixed and exact. Fuzzy logic values range between 1 and 0, i.e. the value may range from completely true to completely false. In contrast, Boolean logic is a two-valued logic, true or false, that’s usually denoted as 1 and 0, respectively. It deals with reasoning that is fixed and exact. Fuzzy logic tends to reflect how people think and attempts to model our decision-making, hence it’s now leading to new intelligent systems (expert systems).
So, if we are comparing two strings using fuzzy logic, we would attempt to answer the question, “How similar are string A and string B?” In Boolean logic, we’d rephrase it as, “Are string A and string B the same?”
Fuzzy String Matching Explained
Fuzzy string matching, also known as approximate string matching, is the process of finding strings that approximately match a pattern. The process has various applications, such as spell checking, DNA analysis and detection, spam detection and plagiarism detection.
How to Do Fuzzy String Matching in Python With FuzzyWuzzy
FuzzyWuzzy is a Python library that uses Levenshtein distance to calculate the differences between sequences and patterns. It was developed and also open-sourced by SeatGeek, a service that finds event tickets from all over the internet and showcases them on one platform. The problem they faced involved the same events being given different labels. This is the same as the example I gave at the beginning of the post, in which an entity, such as a person’s name, can be labeled differently on different sources.
How to Install FuzzyWuzzy
To install the library, you can use pip:
pip install fuzzywuzzy
pip install python-Levenshtein
FuzzyWuzzy Examples
First we have to import the fuzzywuzzy modules:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
Now, we can get the similarity score of two strings by using the two methods methods ratio() or partial_ratio():
fuzz.ratio("Catherine M Gitau","Catherine Gitau")
#91
fuzz.partial_ratio("Catherine M. Gitau","Catherine Gitau")
#100
You’re probably wondering why the scores are different. This is because the fuzz.ratio() method just calculates the edit distance between some ordering of the token in both input strings using the difflib.ratio. The fuzz.partial_ratio() takes in the shortest string, which in this case is “Catherine Gitau” (length 14) , then matches it with all the sub-strings of length (14) in “Catherine M. Gitau.” That means matching with “Catherine Gitau,” which returns 100 percent. You can play around with the strings until you get the gist.
What if we switched up two names in one string? In the following example, I’ve interchanged the name “Catherine Gitau” to “Gitau Catherine.” Let’s see the scores:
fuzz.ratio("Catherine M Gitau","Gitau Catherine")
#55
fuzz.partial_ratio("Catherine M. Gitau","Gitau Catherine")
#60
We see that both methods are giving out low scores. This can be rectified by using token_sort_ratio() method. This method attempts to account for similar strings that are out of order. For example, if we used the above strings again but using token_sort_ratio(), we get the following:
fuzz.token_sort_ratio("Catherine Gitau M.", "Gitau Catherine")
#94
As you can see, we get a high score of 94.
This article has introduced fuzzy string matching, which is a well-known problem that is built on Levenshtein distance. It calculates how similar two strings are. This can also be calculated by finding out the number of operations needed to transform one string to the other, e.g. with the name “Barack,” one might spell it as “Barac.” Only one operation is needed to correct this, i.e. adding a K at the end. You can try this out using the stringdist library in R as such:
adist("Barack", "Barac")
#[1]
Frequently Asked Questions
What is fuzzy string matching?
Fuzzy string matching is the process of identifying two strings that approximately match each other. It accomplishes this by determining the number of operations needed to transform one string into another.
What is FuzzyWuzzy in Python?
FuzzyWuzzy is a Python library that was created and open-sourced by the event ticket company SeatGeek. It uses the Levenshtein distance to calculate the number of operations needed to transform one string into another.
How do I install FuzzyWuzzy?
You can use pip install fuzzywuzzy and pip install python-Levenshtein to install FuzzyWuzzy.
