Master Fuzzy String Matching in Python with fuzzywuzzy: A Practical Guide
Learn how to efficiently clean and deduplicate textual data using Python's fuzzywuzzy library, covering Levenshtein distance fundamentals, installation, three core matching functions, advanced process extraction, and real-world code examples for handling messy Chinese strings and standardizing company names.
Core Principle: Levenshtein Distance
fuzzywuzzy is built on the Levenshtein (edit) distance, which counts the minimum single‑character insertions, deletions or substitutions needed to transform one string into another, and converts this distance into a similarity score from 0 to 100.
Installation
Install the library via pip and optionally the python‑levenshtein accelerator for better performance.
pip install fuzzywuzzy pip install python-levenshteinThree Matching Strategies
fuzz.ratio()
Simple ratio that compares two strings directly, suitable for minor spelling errors.
from fuzzywuzzy import fuzz
s1 = "小米科技有限责任公司"
s2 = "小米科技有限责任公司 (北京)" # redundant info
print(f"Simple ratio: {fuzz.ratio(s1, s2)}") # 80
s3 = "小木科技有限责任公司" # typo
print(f"Simple ratio: {fuzz.ratio(s1, s3)}") # 90fuzz.partial_ratio()
Partial ratio focuses on the best matching substring, useful when a short string is contained within a longer one.
s_long = "四川省 成都市 武侯区 天府大道中段 99号"
s_short = "天府大道中段"
print(f"Simple ratio: {fuzz.ratio(s_long, s_short)}") # 43
print(f"Partial ratio: {fuzz.partial_ratio(s_long, s_short)}") # 100fuzz.token_sort_ratio()
Token sort ratio sorts words alphabetically before comparing, handling word order variations.
s_a = "张三 李四 王五" # original order
s_b = "王五 李四 张三" # shuffled
print(f"Simple ratio: {fuzz.ratio(s_a, s_b)}") # 50
print(f"Token sort ratio: {fuzz.token_sort_ratio(s_a, s_b)}") # 100Advanced Application: Standardization and Cleaning
Using fuzzywuzzy.process.extractOne(), you can match dirty data against a list of clean reference values and automatically replace or flag entries based on a similarity threshold.
from fuzzywuzzy import process
import pandas as pd
STANDARD_NAMES = [
"华为技术有限公司",
"阿里巴巴(中国)有限公司",
"腾讯科技(深圳)有限公司",
"字节跳动有限公司"
]
dirty_data = [
"华微技术有限公斯", # typo
"阿里 巴巴(中国)", # spaces
"深圳腾迅科技", # order + typo
"字节跳動", # traditional Chinese
"网易公司" # no match
]
def clean_company_name(name, choices, threshold=50):
"""Return the best match if score >= threshold, otherwise flag for review."""
best_match, score = process.extractOne(name, choices)
if score >= threshold:
return best_match
else:
return f"【待审核】{name}"
print("--- Batch cleaning results ---")
print(f"{'Original':<20} | {'Cleaned'}")
print("-" * 40)
for dirty_name in dirty_data:
clean_name = clean_company_name(dirty_name, STANDARD_NAMES, threshold=80)
print(f"{dirty_name:<20} | {clean_name}")Output demonstrates how fuzzywuzzy can automatically map noisy company names to their standardized forms, while flagging unmatched entries for manual review.
Conclusion
fuzzywuzzy provides a concise API to tackle fuzzy string matching challenges, enabling effective standardization and deduplication of messy textual data in Python data‑cleaning workflows.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
IT Services Circle
Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
