Fuzzy Matching Guide: How to Fix Inconsistent Text Data in Python

DS
LDS Team
Let's Data Science
15 min readAudio
Fuzzy Matching Guide: How to Fix Inconsistent Text Data in Python
0:00 / 0:00

You have two datasets to merge. One lists a company as "Apple Inc." The other lists "Apple Incorporated." You try a standard SQL JOIN or Pandas merge, and... nothing. Zero matches.

This scenario is the nightmare of data wrangling. In the real world, humans enter data, and humans are inconsistent. We write "N.Y.C.", "New York City", and "NYC" interchangeably. While these variations mean the same thing to a human reader, they are completely distinct to a computer looking for exact binary equality.

Fuzzy matching is the bridge between human inconsistency and machine precision. Instead of asking "Are these strings identical?" fuzzy matching asks "How similar are these strings?"

In this guide, we will move beyond standard cleaning techniques. We will explore the mathematical intuition behind string similarity, implement robust fuzzy matching pipelines in Python using the thefuzz library, and apply these techniques to clean a real-world messy dataset.

What is fuzzy matching?

Fuzzy matching is a technique used to identify non-identical strings that refer to the same entity by calculating a similarity score. Rather than returning a simple True/False based on exact equality, fuzzy matching algorithms quantify how close two text strings are—typically on a scale of 0 to 100—allowing data scientists to set thresholds for "good enough" matches.

When you use a standard equality operator (==) in Python, the string "Data Science" does not equal "Data Science" (notice the double space). To the computer, they are as different as "Apple" and "Banana" because the byte sequences differ.

Fuzzy matching solves this by measuring the "distance" between strings. If the distance is small (meaning few changes are needed to make them identical), the similarity score is high.

How does the Levenshtein Distance algorithm work?

The Levenshtein Distance is the most common metric for calculating string similarity. The metric counts the minimum number of single-character edits required to change one word into another.

An "edit" can be one of three actions:

  1. Insertion (adding a character)
  2. Deletion (removing a character)
  3. Substitution (swapping one character for another)

For example, transforming "kitten" into "sitting":

  1. sitten (Substitute 'k' with 's')
  2. sittin (Substitute 'e' with 'i')
  3. sitting (Insert 'g' at the end)

The distance is 3.

The Mathematical Definition

Formally, the Levenshtein distance between two strings aa and bb (of length a|a| and b|b|) is given by leva,b(a,b)lev_{a,b}(|a|, |b|), where:

leva,b(i,j)={max(i,j) if min(i,j)=0,min{leva,b(i1,j)+1leva,b(i,j1)+1leva,b(i1,j1)+1(aibj) otherwise.\qquad \operatorname{lev}_{a,b}(i,j) = \begin{cases} \max(i,j) & \text{ if } \min(i,j)=0, \\ \min \begin{cases} \operatorname{lev}_{a,b}(i-1,j) + 1 \\ \operatorname{lev}_{a,b}(i,j-1) + 1 \\ \operatorname{lev}_{a,b}(i-1,j-1) + 1_{(a_i \neq b_j)} \end{cases} & \text{ otherwise.} \end{cases}

In Plain English: This formula calculates the cost of transformation.

  • The first case says: "If one string is empty, the cost is just the length of the other string (because you have to insert every character)."
  • The second case finds the cheapest path among three options:
    1. Delete a character from string aa (cost + 1).
    2. Insert a character into string bb (cost + 1).
    3. Match or Substitute: If the characters are the same, move on (cost + 0). If they differ, swap them (cost + 1).

The algorithm recursively finds the path with the lowest total cost.

Similarity Ratio

Data scientists typically prefer a normalized score (0-100) rather than a raw distance number. The standard formula for this ratio is:

Ratio=(1LevenshteinDistanceLength(a)+Length(b))×100\text{Ratio} = \left( 1 - \frac{\text{LevenshteinDistance}}{\text{Length}(a) + \text{Length}(b)} \right) \times 100

In Plain English: This converts the "edit cost" into a percentage. If the distance is 0 (identical strings), the ratio is 100%. If the strings share nothing and require total rewriting, the ratio drops toward 0%.

How do we implement fuzzy matching in Python?

The Python ecosystem relies heavily on thefuzz (formerly fuzzywuzzy) for this task. thefuzz uses the Levenshtein distance to calculate similarity ratios. Under the hood, thefuzz uses difflib or the faster C++ library Levenshtein.

First, let's install the library:

bash
pip install thefuzz

Now, let's look at the four main matching logic types. Understanding the difference between them is critical for avoiding false positives.

1. Simple Ratio

The ratio() function calculates the standard Levenshtein similarity. It is strict about string length and order.

python
from thefuzz import fuzz

# Exact match logic
print(fuzz.ratio("New York City", "New York City"))  
# Output: 100

# Minor difference (typo)
print(fuzz.ratio("New York City", "New York Cty"))   
# Output: 96

# Major difference (extra words hurt the score significantly)
print(fuzz.ratio("New York City", "NYC"))            
# Output: 25

2. Partial Ratio

The partial_ratio() function matches the shortest string against all substrings of the longer string. This approach is excellent when one string is a subset of the other.

python
# "New York" is fully contained inside "New York City"
print(fuzz.partial_ratio("New York", "New York City")) 
# Output: 100

# "New York" vs "York New" (Order still matters!)
print(fuzz.partial_ratio("New York", "York New"))      
# Output: 64

When to use: When you are matching "Company Name" vs "Company Name Inc."

3. Token Sort Ratio

The token_sort_ratio() function tokenizes the string (splits it by spaces), sorts the tokens alphabetically, and then calculates the ratio(). This approach ignores word order.

python
# Different order, same words
print(fuzz.token_sort_ratio("New York City", "City New York")) 
# Output: 100

When to use: When word order is inconsistent, e.g., "Smith, John" vs "John Smith".

4. Token Set Ratio

The token_set_ratio() function is the most flexible. It splits tokens, removes duplicates, finds the intersection (common words), and then calculates similarity. This method is incredibly powerful for strings with varying levels of detail.

python
# Duplicate words and extra noise
str1 = "New York City"
str2 = "New York City (The Big Apple)"

print(fuzz.token_set_ratio(str1, str2)) 
# Output: 100

In Plain English: The algorithm sees that "New", "York", and "City" appear in both strings. Since the core intersection matches perfectly, token_set_ratio gives a perfect score, ignoring the extra text in the parenthesis.

Hands-on: Cleaning a Real-World Dataset

Let's apply these concepts to our lds_data_wrangling.csv dataset. We have inconsistent entries in the City and Category columns.

⚠️ Prerequisite: We will use Pandas for data manipulation. If you are new to cleaning workflows, check out our guide on Data Cleaning: A Complete Workflow.

Step 1: Loading and Inspecting the Mess

First, we load the data and look at the unique values in the City column to understand the scope of the problem.

python
import pandas as pd
from thefuzz import process, fuzz

# Load the dataset
df = pd.read_csv('lds_data_wrangling.csv')

# Inspect unique cities to see the mess
unique_cities = df['City'].unique()
print(f"Unique Cities (First 10):\n{unique_cities[:10]}")

Expected Output:

text
Unique Cities (First 10):
['Boston' 'boston' 'Chicago' 'CHICAGO' 'Seatle' 'Seattle' 'San Fran' 'SF' 'New York' 'NYC']

We see casing issues ("boston"), typos ("Seatle"), and abbreviations ("SF"). A standard groupby would treat "Seattle" and "Seatle" as different cities.

Step 2: Creating a Canonical List

To perform fuzzy matching effectively, you generally need a "target" list of correct values. If you don't have an external dictionary of valid cities, you can use the most frequent values in your dataset as the "source of truth."

python
# Get the top 10 most frequent cities to serve as our "Gold Standard"
# In a real project, you might load this list from a verified database
correct_cities = df['City'].value_counts().index[:10].tolist()

print("Target Cities:", correct_cities)

Step 3: The Matching Function

Now we will write a function that takes a messy string, compares it against our correct_cities list, and returns the best match only if the score exceeds a threshold.

We use process.extractOne, which automatically compares the query string against a list of choices and returns the best match.

python
def fuzzy_match_city(messy_city, choices, threshold=80):
    """
    Finds the best match for a messy city name from a list of choices.
    """
    # Convert to string to handle NaN or numbers
    messy_city = str(messy_city)
    
    # Extract the single best match using token_sort_ratio for robustness
    # scorer=fuzz.token_sort_ratio handles casing and reordering
    match, score = process.extractOne(messy_city, choices, scorer=fuzz.token_sort_ratio)
    
    if score >= threshold:
        return match
    else:
        # If no good match, return the original (or mark as "Unknown")
        return messy_city

# Apply the function to the City column
# We create a new column to preserve original data for verification
df['City_Cleaned'] = df['City'].apply(lambda x: fuzzy_match_city(x, correct_cities))

# Check the results
comparison = df[['City', 'City_Cleaned']].drop_duplicates().head(10)
print(comparison)

Expected Output:

text
       City City_Cleaned
0    Boston       Boston
1    boston       Boston
2   Seatle       Seattle
3   Seattle      Seattle
4   Chicago      Chicago
5   CHICAGO      Chicago
...

The function successfully mapped "boston" (lowercase) and "Seatle" (typo) to their correct forms.

How do we handle categorical inconsistency?

The Category column often suffers from similar issues: "Cloth.", "Clothing", "Apparel". While fuzzy matching fixes typos, it cannot fix semantic synonyms (words that mean the same thing but look different).

💡 Pro Tip: Fuzzy matching fixes syntax (spelling). It does not fix semantics (meaning). "Laptop" and "Notebook" have a low fuzzy score but identical meaning. For semantic matching, you would need NLP techniques like Word Embeddings.

However, for variations like "Electronics" vs "Elec.", fuzzy matching works well.

python
# Let's clean the Category column
unique_categories = df['Category'].unique()
print(f"Messy Categories: {unique_categories[:5]}")

# Define a manual dictionary for known abbreviations that fuzzy match might miss
# This is a hybrid approach: Dictionary first, then Fuzzy Match
category_mapping = {
    'Elec.': 'Electronics',
    'Cloth.': 'Clothing'
}

# Standardize distinct categories using fuzzy matching
# Assume valid categories are: ['Electronics', 'Clothing', 'Sports', 'Books']
valid_categories = ['Electronics', 'Clothing', 'Sports', 'Books']

def clean_category(cat):
    # 1. Direct dictionary lookup
    if cat in category_mapping:
        return category_mapping[cat]
    
    # 2. Fuzzy match
    match, score = process.extractOne(cat, valid_categories, scorer=fuzz.token_set_ratio)
    if score > 85:
        return match
    return cat

df['Category_Cleaned'] = df['Category'].apply(clean_category)
print(df[['Category', 'Category_Cleaned']].drop_duplicates())

What are the performance limitations?

Fuzzy matching is computationally expensive. The complexity is roughly O(N×M)O(N \times M), where NN is the number of rows in your dirty dataset and MM is the number of entries in your clean reference list.

If you have 100,000 messy rows and compare them against 1,000 reference cities, that is 100 million operations.

Strategies for Speed

  1. Blocking (Preprocessing): Only compare strings that share the same first letter or sound (using phonetic algorithms like Soundex).
  2. Vectorization: Use rapidfuzz instead of thefuzz. rapidfuzz is a drop-in replacement written in C++ that is significantly faster.
  3. TF-IDF Filtering: Use TF-IDF (Term Frequency-Inverse Document Frequency) to find a candidate set of "similar looking" strings using cosine similarity (which is fast matrix math) before running the expensive Levenshtein distance on just the top 5 candidates.

How do we choose the right threshold?

Choosing the cutoff score (e.g., 80 vs 90) is an art.

  • High Threshold (>90): Safe, high precision. You will miss some matches (False Negatives), but the matches you find will likely be correct.
  • Low Threshold (<70): Risky. You catch more matches (High Recall), but you introduce False Positives.

Example of a False Positive: comparing "Austria" and "Australia".

  • fuzz.ratio("Austria", "Australia") = 71

If you set your threshold to 70, your script will change "Austria" to "Australia," corrupting your geographical data.

⚠️ Common Pitfall: Never blindly accept fuzzy matches in production without a manual audit step or a very high confidence threshold (90+). Always flag uncertain matches (70-85) for human review.

Conclusion

Fuzzy matching is an essential tool in the data wrangler's toolkit, bridging the gap between messy human entry and strict machine logic. By quantifying text similarity using Levenshtein distance, we can automate the standardization of inconsistent categories, names, and addresses.

In this guide, we covered:

  • The Intuition: How edit distance measures similarity.
  • The Algorithms: When to use ratio, partial_ratio, and token_set_ratio.
  • The Implementation: Using process.extractOne to clean the City column.
  • The Risks: Why thresholds matter to prevent False Positives.

To master data wrangling, your next step is ensuring your data types are handled correctly before matching. Check out our guide on Mastering Messy Dates in Python to handle temporal data with the same level of robustness.


Hands-On Practice

In the real world, text data is rarely perfect. Humans make typos, use different abbreviations (like 'NYC' vs 'New York City'), and vary casing. This inconsistency makes grouping data or joining tables impossible with standard exact matching. Fuzzy matching solves this by calculating the similarity between strings. In this guide, we will implement a robust fuzzy matching pipeline using Python's standard libraries to clean a messy customer dataset, consolidating inconsistent city names into a canonical format.

Dataset: Customer Data (Data Wrangling) Intentionally messy customer dataset with 1050 rows designed for data wrangling tutorials. Contains missing values (MCAR, MAR, MNAR patterns), exact/near duplicates, messy date formats, inconsistent categories with typos, mixed data types, and outliers. Includes clean reference columns for validation.

Try It Yourself

Data Wrangling
Loading editor...
0/50 runs

Data Wrangling: 1,050 messy customer records with inconsistent formats for cleaning practice

By applying fuzzy matching, we successfully consolidated fragmented text data. The algorithm identified that 'Seatle', 'seattle', and 'Seattle ' were all the same entity, reducing the number of unique categories and improving data quality for downstream analysis. While we used Python's standard difflib here, libraries like thefuzz (formerly fuzzywuzzy) offer even more optimized algorithms for large-scale production systems.