Blog

FuzzyWuzzy Using Python

August 24, 2017

What is Fuzzy String Matching?

Fuzzy string matching is the process of finding strings that match a given pattern approximately (rather than exactly), like literally. Hence it is also known as approximate string matching. Usually the pattern that these strings are matched against is another string.

The degree of closeness between two strings is measured using Levenshtein Distance, also known as edit distance which basically is based on counting number of primitive operations required to convert one string to the exact match of the other string. The similarity index, often represented as a score out of 100, is calculated using the Levenshtein distance to quantify the similarity between two strings.

These primitive operations can consist of:

Insertion (to insert a new character at a given position)
Deletion (to delete a particular character)
Substitution (to replace a character with a new one)
Transposition (to swap positions of two letters)

Why Fuzzy Match?

Many organizations have been facing the problem of poor data quality that has prevented them from extracting useful customer insights or led to poor business decisions. The primary root cause for poor data quality has been duplicate records for most of the organizations. While it’s fairly straightforward to capture the duplicates which are exact matches, however, spotting the non-exact matches has been a difficult problem to tackle.

Also, using data which is not duplicate-free organizations tend to initiate poorly performing customer- response driven campaigns or waste of resources to manually identify the approximate matches.

For example, customer ‘Lisbeth’ who purchased product A according to store database at Location 1 may be same as ‘Lis’ who bought the same product from a different location of store according to store database at Location 2. So this is a case of same customer buying the same product and not different customers with same buying patterns.

With the advent of fuzzy matching algorithms, it has been possible to identify these hard-to-spot approximate matches.

Origin of FuzzyWuzzy package in Python

The FuzzyWuzzy library in Python was developed and open-sourced by Seatgeek to tackle the ticket search use case for their website. The original use case is discussed in detail on their blog here.

Using FuzzyWuzzy

Note that all examples in this blog are tested in Azure ML Jupyter Notebook (Python 3).

The two libraries that we need to install to use fuzzywuzzy in python are:

fuzzywuzzy
python-Levenshtein

To use the FuzzyWuzzy library, you need to import the necessary modules: from fuzzywuzzy import fuzz and from fuzzywuzzy import process.

Four ways of Fuzzy matching

There are four popular types of fuzzy matching logic supported by the FuzzyWuzzy Python library:

Ratio – uses pure Levenshtein Distance based matching
Partial Ratio – matches based on best substrings
Token Sort Ratio – tokenizes the strings and sorts them alphabetically before matching
Token Set Ratio – tokenizes the strings and compares the intersection and remainder

The simple ratio method calculates the similarity ratio between two strings using the Levenshtein distance.

The code snippets below highlight the difference between these four algorithms with some generic use cases:

When compared strings differ by punctuation

2. When compared strings have different case

3. When compared strings are in different order

4. When compared strings are subset

Comparing against list of choices

The code snippet below demonstrates how you can get scores against a list of choices for a string with any of the four scorers (ratio, partial_ratio, token_sort_ratio, token_set_ratio). The choice of scorer depends on the nature of data and nature of desired results.

We can also use “score_cutoff” argument to set a threshold for the best match score. If the best match score is below threshold, it will return “None” as shown in code snippet below.

The process library in FuzzyWuzzy can be used to find the best possible string match among a list of strings, making it a powerful tool for text similarity assessment.

Applying FuzzyMatch to entire dataset

The code snippet below demonstrates how fuzzy match can be applied to an entire column of dataset_1 to return best score against the column of dataset_2 with scorer as ‘token_set_ratio’ and score_cutoff as ‘90′.

This will return the results in the format:

More Fuzzy Match Use Cases

Here’s a list of couple of use cases where fuzzy match can be used:

To match customers for tracking all purchases of a customer to identifying the buying behavior
To match customer addresses for segmenting customers based on location
To find approximate matches for a search key
To match file paths
For spell-checking
To detect plagiarism (text re-use)
To match DNA sequences
For spam filtering

Categories: Data & Analytics

divya

Consultant 1
div.saini@neudesic.com