Fundamentals 7 min read

Master Fuzzy String Matching in Python with fuzzywuzzy: A Practical Guide

Learn how to efficiently clean and deduplicate textual data using Python's fuzzywuzzy library, covering Levenshtein distance fundamentals, installation, three core matching functions, advanced process extraction, and real-world code examples for handling messy Chinese strings and standardizing company names.

IT Services Circle
IT Services Circle
IT Services Circle
Master Fuzzy String Matching in Python with fuzzywuzzy: A Practical Guide

Core Principle: Levenshtein Distance

fuzzywuzzy is built on the Levenshtein (edit) distance, which counts the minimum single‑character insertions, deletions or substitutions needed to transform one string into another, and converts this distance into a similarity score from 0 to 100.

Installation

Install the library via pip and optionally the python‑levenshtein accelerator for better performance.

pip install fuzzywuzzy
pip install python-levenshtein

Three Matching Strategies

fuzz.ratio()

Simple ratio that compares two strings directly, suitable for minor spelling errors.

from fuzzywuzzy import fuzz
s1 = "小米科技有限责任公司"
s2 = "小米科技有限责任公司 (北京)"  # redundant info
print(f"Simple ratio: {fuzz.ratio(s1, s2)}")  # 80
s3 = "小木科技有限责任公司"  # typo
print(f"Simple ratio: {fuzz.ratio(s1, s3)}")  # 90

fuzz.partial_ratio()

Partial ratio focuses on the best matching substring, useful when a short string is contained within a longer one.

s_long = "四川省 成都市 武侯区 天府大道中段 99号"
s_short = "天府大道中段"
print(f"Simple ratio: {fuzz.ratio(s_long, s_short)}")   # 43
print(f"Partial ratio: {fuzz.partial_ratio(s_long, s_short)}")  # 100

fuzz.token_sort_ratio()

Token sort ratio sorts words alphabetically before comparing, handling word order variations.

s_a = "张三 李四 王五"  # original order
s_b = "王五 李四 张三"  # shuffled
print(f"Simple ratio: {fuzz.ratio(s_a, s_b)}")          # 50
print(f"Token sort ratio: {fuzz.token_sort_ratio(s_a, s_b)}")  # 100

Advanced Application: Standardization and Cleaning

Using fuzzywuzzy.process.extractOne(), you can match dirty data against a list of clean reference values and automatically replace or flag entries based on a similarity threshold.

from fuzzywuzzy import process
import pandas as pd

STANDARD_NAMES = [
    "华为技术有限公司",
    "阿里巴巴(中国)有限公司",
    "腾讯科技(深圳)有限公司",
    "字节跳动有限公司"
]

dirty_data = [
    "华微技术有限公斯",      # typo
    "阿里 巴巴(中国)",    # spaces
    "深圳腾迅科技",          # order + typo
    "字节跳動",              # traditional Chinese
    "网易公司"               # no match
]

def clean_company_name(name, choices, threshold=50):
    """Return the best match if score >= threshold, otherwise flag for review."""
    best_match, score = process.extractOne(name, choices)
    if score >= threshold:
        return best_match
    else:
        return f"【待审核】{name}"

print("--- Batch cleaning results ---")
print(f"{'Original':<20} | {'Cleaned'}")
print("-" * 40)
for dirty_name in dirty_data:
    clean_name = clean_company_name(dirty_name, STANDARD_NAMES, threshold=80)
    print(f"{dirty_name:<20} | {clean_name}")

Output demonstrates how fuzzywuzzy can automatically map noisy company names to their standardized forms, while flagging unmatched entries for manual review.

Conclusion

fuzzywuzzy provides a concise API to tackle fuzzy string matching challenges, enabling effective standardization and deduplication of messy textual data in Python data‑cleaning workflows.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Pythondata cleaningstring-matchingfuzzywuzzyLevenshtein
IT Services Circle
Written by

IT Services Circle

Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.