Fundamentals 10 min read

Using FuzzyWuzzy for Fuzzy String Matching in Python

This article introduces the FuzzyWuzzy Python library, explains its Levenshtein‑based matching functions (Ratio, Partial Ratio, Token Sort Ratio, Token Set Ratio) and the process module, and demonstrates practical applications for fuzzy matching of company and province names with complete code examples.

Python Programming Learning Circle

Feb 4, 2024

Using FuzzyWuzzy for Fuzzy String Matching in Python

In data processing, fields often contain slight variations (e.g., "Guangxi" vs. "Guangxi Zhuang Autonomous Region"), requiring flexible matching. The article presents FuzzyWuzzy, a Python package that leverages the Levenshtein Distance algorithm to compute similarity scores between strings.

The library provides four main ratio functions: fuzz.ratio – simple similarity score. fuzz.partial_ratio – compares the best matching substring. fuzz.token_sort_ratio – tokenizes, lower‑cases, and sorts words before comparison. fuzz.token_set_ratio – removes duplicate tokens before scoring.

Additionally, the process module offers: process.extract – returns a list of the best matches with scores. process.extractOne – returns the single highest‑scoring match as a tuple.

Practical examples show how to fuzzy‑match company names and province names. A reusable function fuzzy_merge is defined to join two DataFrames on fuzzy‑matched keys, with parameters for the left and right tables, key columns, similarity threshold, and result limit.

# fuzzy matching function

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
    """
    :param df_1: left DataFrame
    :param df_2: right DataFrame
    :param key1: column in df_1 to match
    :param key2: column in df_2 to match
    :param threshold: minimum similarity score (0‑100)
    :param limit: number of top matches to consider
    :return: DataFrame with a new 'matches' column containing the best match
    """
    s = df_2[key2].tolist()
    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
    df_1['matches'] = m
    m2 = df_1['matches'].apply(lambda x: [i[0] for i in x if i[1] >= threshold][0] if len([i[0] for i in x if i[1] >= threshold]) > 0 else '')
    df_1['matches'] = m2
    return df_1

from fuzzywuzzy import fuzz, process

# example usage
result_df = fuzzy_merge(data, company, '公司名称', '公司名称', threshold=90)
result_df

The article also includes installation instructions (

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple FuzzyWuzzy

) and notes that installing python-Levenshtein can speed up calculations.

Overall, the guide equips readers with the concepts, functions, and ready‑to‑use code to perform robust fuzzy string matching in Python for data cleaning and integration tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python data cleaning pandas string-matching Levenshtein distance

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.