FuzzyWuzzy Using Python
What is Fuzzy String Matching?
Fuzzy string matching is the process of finding strings that match a given pattern approximately (rather than exactly), like literally. Hence it is also known as approximate string matching. Usually the pattern that these strings are matched against is another string.
The degree of closeness between two strings is measured using Levenshtein Distance, also known as edit distance which basically is based on counting number of primitive operations required to convert one string to the exact match of the other string. The similarity index, often represented as a score out of 100, is calculated using the Levenshtein distance to quantify the similarity between two strings.
These primitive operations can consist of:
- Insertion (to insert a new character at a given position)
- Deletion (to delete a particular character)
- Substitution (to replace a character with a new one)
- Transposition (to swap positions of two letters)
Why Fuzzy Match?
Many organizations have been facing the problem of poor data quality that has prevented them from extracting useful customer insights or led to poor business decisions. The primary root cause for poor data quality has been duplicate records for most of the organizations. While it’s fairly straightforward to capture the duplicates which are exact matches, however, spotting the non-exact matches has been a difficult problem to tackle.
Also, using data which is not duplicate-free organizations tend to initiate poorly performing customer- response driven campaigns or waste of resources to manually identify the approximate matches.
For example, customer ‘Lisbeth’ who purchased product A according to store database at Location 1 may be same as ‘Lis’ who bought the same product from a different location of store according to store database at Location 2. So this is a case of same customer buying the same product and not different customers with same buying patterns.
With the advent of fuzzy matching algorithms, it has been possible to identify these hard-to-spot approximate matches.
Origin of FuzzyWuzzy package in Python
The FuzzyWuzzy library in Python was developed and open-sourced by Seatgeek to tackle the ticket search use case for their website. The original use case is discussed in detail on their blog here.
Using FuzzyWuzzy
Note that all examples in this blog are tested in Azure ML Jupyter Notebook (Python 3).
The two libraries that we need to install to use fuzzywuzzy in python are:
- fuzzywuzzy
- python-Levenshtein
To use the FuzzyWuzzy library, you need to import the necessary modules: from fuzzywuzzy import fuzz and from fuzzywuzzy import process.
Four ways of Fuzzy matching
There are four popular types of fuzzy matching logic supported by the FuzzyWuzzy Python library:
- Ratio – uses pure Levenshtein Distance based matching
- Partial Ratio – matches based on best substrings
- Token Sort Ratio – tokenizes the strings and sorts them alphabetically before matching
- Token Set Ratio – tokenizes the strings and compares the intersection and remainder
The simple ratio method calculates the similarity ratio between two strings using the Levenshtein distance.
The code snippets below highlight the difference between these four algorithms with some generic use cases:
- When compared strings differ by punctuation
2. When compared strings have different case
3. When compared strings are in different order
4. When compared strings are subset
Comparing against list of choices
The code snippet below demonstrates how you can get scores against a list of choices for a string with any of the four scorers (ratio, partial_ratio, token_sort_ratio, token_set_ratio). The choice of scorer depends on the nature of data and nature of desired results.
We can also use “score_cutoff” argument to set a threshold for the best match score. If the best match score is below threshold, it will return “None” as shown in code snippet below.
The process library in FuzzyWuzzy can be used to find the best possible string match among a list of strings, making it a powerful tool for text similarity assessment.
Applying FuzzyMatch to entire dataset
The code snippet below demonstrates how fuzzy match can be applied to an entire column of dataset_1 to return best score against the column of dataset_2 with scorer as ‘token_set_ratio’ and score_cutoff as ‘90′.
This will return the results in the format:
More Fuzzy Match Use Cases
Here’s a list of couple of use cases where fuzzy match can be used:
- To match customers for tracking all purchases of a customer to identifying the buying behavior
- To match customer addresses for segmenting customers based on location
- To find approximate matches for a search key
- To match file paths
- For spell-checking
- To detect plagiarism (text re-use)
- To match DNA sequences
- For spam filtering