Add a header to begin generating the table of contents

    What is Fuzzy String Matching?

    Fuzzy string matching is the process of finding strings that match a given pattern approximately (rather than exactly), like literally. Hence it is also known as approximate string matching. Usually the pattern that these strings are matched against is another string.

    The degree of closeness between two strings is measured using Levenshtein Distance, also known as edit distance which basically is based on counting number of primitive operations required to convert one string to the exact match of the other string. The similarity index, often represented as a score out of 100, is calculated using the Levenshtein distance to quantify the similarity between two strings.

    These primitive operations can consist of:

    • Insertion (to insert a new character at a given position)
    • Deletion (to delete a particular character)
    • Substitution (to replace a character with a new one)
    • Transposition (to swap positions of two letters)

    Why Fuzzy Match?

    Many organizations have been facing the problem of poor data quality that has prevented them from extracting useful customer insights or led to poor business decisions. The primary root cause for poor data quality has been duplicate records for most of the organizations. While it’s fairly straightforward to capture the duplicates which are exact matches, however, spotting the non-exact matches has been a difficult problem to tackle.

    Also, using data which is not duplicate-free organizations tend to initiate poorly performing customer- response driven campaigns or waste of resources to manually identify the approximate matches.

    For example, customer ‘Lisbeth’ who purchased product A according to store database at Location 1 may be same as ‘Lis’ who bought the same product from a different location of store according to store database at Location 2. So this is a case of same customer buying the same product and not different customers with same buying patterns.

    With the advent of fuzzy matching algorithms, it has been possible to identify these hard-to-spot approximate matches.

    Origin of FuzzyWuzzy package in Python

    The FuzzyWuzzy library in Python was developed and open-sourced by Seatgeek to tackle the ticket search use case for their website. The original use case is discussed in detail on their blog here.

    Using FuzzyWuzzy

    Note that all examples in this blog are tested in Azure ML Jupyter Notebook (Python 3).

    The two libraries that we need to install to use fuzzywuzzy in python are:

    1. fuzzywuzzy
    2. python-Levenshtein

    To use the FuzzyWuzzy library, you need to import the necessary modules: from fuzzywuzzy import fuzz and from fuzzywuzzy import process.

    Four ways of Fuzzy matching

    There are four popular types of fuzzy matching logic supported by the FuzzyWuzzy Python library:

    1. Ratio – uses pure Levenshtein Distance based matching
    2. Partial Ratio – matches based on best substrings
    3. Token Sort Ratio – tokenizes the strings and sorts them alphabetically before matching
    4. Token Set Ratio – tokenizes the strings and compares the intersection and remainder

    The simple ratio method calculates the similarity ratio between two strings using the Levenshtein distance.

    The code snippets below highlight the difference between these four algorithms with some generic use cases:

    1. When compared strings differ by punctuation

    2. When compared strings have different case

    3. When compared strings are in different order

    4. When compared strings are subset

    Comparing against list of choices

    The code snippet below demonstrates how you can get scores against a list of choices for a string with any of the four scorers (ratio, partial_ratio, token_sort_ratio, token_set_ratio). The choice of scorer depends on the nature of data and nature of desired results.

    We can also use “score_cutoff” argument to set a threshold for the best match score. If the best match score is below threshold, it will return “None” as shown in code snippet below.

    The process library in FuzzyWuzzy can be used to find the best possible string match among a list of strings, making it a powerful tool for text similarity assessment.

    Applying FuzzyMatch to entire dataset

    The code snippet below demonstrates how fuzzy match can be applied to an entire column of dataset_1 to return best score against the column of dataset_2 with scorer as ‘token_set_ratio’ and score_cutoff as ‘90′.

    This will return the results in the format:

    More Fuzzy Match Use Cases

    Here’s a list of couple of use cases where fuzzy match can be used:

    • To match customers for tracking all purchases of a customer to identifying the buying behavior
    • To match customer addresses for segmenting customers based on location
    • To find approximate matches for a search key
    • To match file paths
    • For spell-checking
    • To detect plagiarism (text re-use)
    • To match DNA sequences
    • For spam filtering

    Related Posts

    Microsoft Fabric is amazing, and Maestro makes it better.

    Managing data is harder than ever. Organizations are dealing with […]

    How does agentic AI and industrial internet of things (IIoT) impact manufacturing?

    Manufacturers are forced to rethink how they operate as a […]

    What happens when you combine Microsoft Fabric, Purview, and Power Platform?

    In today’s fast-paced business landscape, effective data integration is crucial. […]

    Giving Thanks to Improved Labor and Operations Management for Retail with Data and AI

    There is no right way of celebrating the holiday season. […]

    Divya Saini

    Consultant 1

    Subscribe

    Sign up for emails on new digital articles and other news

    Subject to Neudesic's Privacy Policy, you agree to allow Neudesic to use your contact details to keep you informed about products, services, and offers. You can opt-out at any time.