Fundamentals 10 min read

Using FuzzyWuzzy for Fuzzy String Matching in Python

This article introduces the FuzzyWuzzy Python library, explains its Levenshtein‑based matching functions, shows how to install it, and provides step‑by‑step code examples for fuzzy matching of company and province names using pandas dataframes.

Python Programming Learning Circle

May 21, 2021

Using FuzzyWuzzy for Fuzzy String Matching in Python

When processing data, you often need to match fields that have different formats, such as abbreviated region names ("北京", "广西") versus their full forms ("北京市", "广西壮族自治区"). The article presents a solution using the FuzzyWuzzy library.

FuzzyWuzzy is a simple, easy‑to‑use fuzzy string matching toolkit that relies on the Levenshtein Distance (also called Edit Distance) algorithm to compute similarity between two strings.

Install the library in an Anaconda Jupyter Notebook environment with:

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple FuzzyWuzzy

The fuzz module provides four main functions:

Ratio – simple similarity score.

Partial Ratio – higher precision for substrings.

Token Sort Ratio – ignores word order.

Token Set Ratio – removes duplicate tokens before comparison.

Example usage:

fuzz.ratio("河南省", "河南省")  # >>> 100

fuzz.partial_ratio("河南", "河南省")  # >>> 100

fuzz.token_sort_ratio("西藏 自治区", "自治区 西藏")  # >>> 100

fuzz.token_set_ratio("西藏 西藏 自治区", "自治区 西藏")  # >>> 100

The process module handles limited candidate lists, returning the best matches and their scores. Functions include: process.extract(query, choices, limit=n) – returns a list of the top n matches. process.extractOne(query, choices) – returns the single best match as a tuple.

Example:

choices = ["河南省", "郑州市", "湖北省", "武汉市"]

process.extract("郑州", choices, limit=2)  # [('郑州市', 90), ('河南省', 0)]

process.extractOne("郑州", choices)  # ('郑州市', 90)

Practical Application 1: Company Name Matching – The article shows how to wrap the fuzzy matching logic into a reusable function fuzzy_merge that merges two pandas DataFrames on fuzzy‑matched company names, with parameters for the left/right DataFrames, key columns, similarity threshold, and result limit.

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
    s = df_2[key2].tolist()
    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
    df_1['matches'] = m
    m2 = df_1['matches'].apply(lambda x: [i[0] for i in x if i[1] >= threshold][0] if len([i[0] for i in x if i[1] >= threshold]) > 0 else '')
    df_1['matches'] = m2
    return df_1

Parameters explained: df_1: left DataFrame (e.g., your data). df_2: right DataFrame (e.g., company reference table). key1 / key2: column names containing the strings to match. threshold: minimum similarity score (default 90) to accept a match. limit: number of top candidates to consider.

After running fuzzy_merge, the resulting DataFrame contains a new matches column with the best‑matched company name or an empty string if no match meets the threshold.

Practical Application 2: Province Name Matching – The same function is applied to match abbreviated province names to their full forms, demonstrating its versatility.

Finally, the article provides the complete source code for the fuzzy_merge function, including the necessary imports:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

# fuzzy_merge function definition (as shown above)

df = fuzzy_merge(data, company, '公司名称', '公司名称', threshold=90)
print(df)

The article concludes that encapsulating such utilities into reusable modules simplifies future data cleaning tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

string-matching fuzzywuzzy Levenshtein

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.