What is Fuzzy String Matching?
Fuzzy string matching is the process of finding strings that match a given pattern approximately (rather than exactly), like literally. Hence it is also known as approximate string matching. Usually the pattern that these strings are matched against is another string.
The degree of closeness between two strings is measured using Levenshtein Distance, also known as edit distance which basically is based on counting number of primitive operations required to convert one string to the exact match of the other string.
These primitive operations can consist of:
Insertion (to insert a new character at a given position)
Deletion (to delete a particular character)
Substitution (to replace a character with a new one)
- Transposition (to swap positions of two letters)
Why Fuzzy Match?
Many organizations have been facing the problem of poor data quality that has prevented them from extracting useful customer insights or led to poor business decisions. The primary root cause for poor data quality has been duplicate records for most of the organizations. While it’s fairly straightforward to capture the duplicates which are exact matches, however, spotting the non-exact matches has been a difficult problem to tackle.
Also, using data which is not duplicate-free organizations tend to initiate poorly performing customer- response driven campaigns or waste of resources to manually identify the approximate matches.
For example, customer ‘Lisbeth’ who purchased product A according to store database at Location 1 may be same as ‘Lis’ who bought the same product from a different location of store according to store database at Location 2. So this is a case of same customer buying the same product and not different customers with same buying patterns.
With the advent of fuzzy matching algorithms, it has been possible to identify these hard-to-spot approximate matches.
Origin of FuzzyWuzzy package in Python
FuzzyWuzzy package in python was developed and open-sourced by Seatgeek to tackle the ticket search usecase for their website. The original usecase is discussed in detail on their blog here.
Note that all examples in this blog are tested in Azure ML Jupyter Notebook (Python 3).
The two libraries that we need to install to use fuzzywuzzy in python are:
Four ways of Fuzzy matching
There are four popular types of fuzzy matching logic supported by fuzzywuzzy package:
- Ratio – uses pure Levenshtein Distance based matching
- Partial Ratio – matches based on best substrings
- Token Sort Ratio – tokenizes the strings and sorts them alphabetically before matching
- Token Set Ratio – tokenizes the strings and compared the intersection and remainder
The code snippets below highlights the difference between these four algorithms with some generic use cases:
1. When compared strings differ by punctuation
2. When compared strings have different case
3. When compared strings are in different order
4. When compared strings are subset
Comparing against list of choices
The code snippet below demonstrates how you can get scores against a list of choices for a string with any of the four scorers (ratio, partial_ratio, token_sort_ratio, token_set_ratio). The choice of scorer depends on the nature of data and nature of desired results.
We can also use “score_cutoff” argument to set a threshold for the best match score. If the best match score is below threshold, it will return “None” as shown in code snippet below.
Applying FuzzyMatch to entire dataset
The code snippet below demonstrates how fuzzy match can be applied to an entire column of dataset_1 to return best score against the column of dataset_2 with scorer as ‘token_set_ratio’ and score_cutoff as ‘90’.
This will return the results in the format:
More Fuzzy Match Use Cases
Here’s a list of couple of use cases where fuzzy match can be used:
To match customers for tracking all purchases of a customer to identifying the buying behavior
To match customer addresses for segmenting customers based on location
To find approximate matches for a search key
To match file paths
To detect plagiarism (text re-use)
To match DNA sequences
For spam filtering