What is Fuzzy String Matching?

Fuzzy string matching is the process of finding strings that match a given pattern approximately (rather than exactly), like literally. Hence it is also known as approximate string matching. Usually the pattern that these strings are matched against is another string.

The degree of closeness between two strings is measured using Levenshtein Distance, also known as edit distance which basically is based on counting number of primitive operations required to convert one string to the exact match of the other string.

These primitive operations can consist of:

  •  Insertion (to insert a new character at a given position)

  • Deletion (to delete a particular character)

  • Substitution (to replace a character with a new one)

  • Transposition (to swap positions of two letters)

Why Fuzzy Match?

Many organizations have been facing the problem of poor data quality that has prevented them from extracting useful customer insights or led to poor business decisions. The primary root cause for poor data quality has been duplicate records for most of the organizations. While it’s fairly straightforward to capture the duplicates which are exact matches, however, spotting the non-exact matches has been a difficult problem to tackle.

Also, using data which is not duplicate-free organizations tend to initiate poorly performing customer- response driven campaigns or waste of resources to manually identify the approximate matches.

For example, customer ‘Lisbeth’ who purchased product A according to store database at Location 1 may be same as ‘Lis’ who bought the same product from a different location of store according to store database at Location 2. So this is a case of same customer buying the same product and not different customers with same buying patterns.

With the advent of fuzzy matching algorithms, it has been possible to identify these hard-to-spot approximate matches.

Origin of FuzzyWuzzy package in Python

FuzzyWuzzy package in python was developed and open-sourced by Seatgeek to tackle the ticket search usecase for their website. The original usecase is discussed in detail on their blog here

Using FuzzyWuzzy

Note that all examples in this blog are tested in Azure ML Jupyter Notebook (Python 3).

The two libraries that we need to install to use fuzzywuzzy in python are:

  1. fuzzywuzzy
  2. python-Levenshtein

Four ways of Fuzzy matching

There are four popular types of fuzzy matching logic supported by fuzzywuzzy package:

  1. Ratio – uses pure Levenshtein Distance based matching
  2. Partial Ratio – matches based on best substrings
  3. Token Sort Ratio – tokenizes the strings and sorts them alphabetically before matching
  4. Token Set Ratio – tokenizes the strings and compared the intersection and remainder

The code snippets below highlights the difference between these four algorithms with some generic use cases:

1. When compared strings differ by punctuation

2. When compared strings have different case

3. When compared strings are in different order

4. When compared strings are subset

Comparing against list of choices

The code snippet below demonstrates how you can get scores against a list of choices for a string with any of the four scorers (ratio, partial_ratio, token_sort_ratio, token_set_ratio). The choice of scorer depends on the nature of data and nature of desired results.

We can also use “score_cutoff” argument to set a threshold for the best match score. If the best match score is below threshold, it will return “None” as shown in code snippet below.

Applying FuzzyMatch to entire dataset

The code snippet below demonstrates how fuzzy match can be applied to an entire column of dataset_1 to return best score against the column of dataset_2 with scorer as ‘token_set_ratio’ and score_cutoff as ‘90’.

This will return the results in the format:

More Fuzzy Match Use Cases

Here’s a list of couple of use cases where fuzzy match can be used:

  • To match customers for tracking all purchases of a customer to identifying the buying behavior

  • To match customer addresses for segmenting customers based on location

  • To find approximate matches for a search key

  • To match file paths

  • For spell-checking

  • To detect plagiarism (text re-use)

  • To match DNA sequences

  • For spam filtering