How to Extract Complex Patterns with Python Regex and Pandas
This article walks through a Python regex challenge of extracting strings that start with a specific Chinese prefix, contain six digits, four digits, and the character "号", showing the original issue, code attempts, and the final refined solution using pandas.
1. Introduction
Hello, I'm PiPi. Recently a member of a Python community asked how to extract data matching a specific pattern using Python, so I’m sharing the solution here.
2. Problem Description
The goal is to extract strings that begin with "闽移需【" followed by six consecutive digits, then "】", four consecutive digits, and the Chinese character "号" (e.g., "闽移需【202303】1311号"). The original attempt failed to capture the Chinese characters.
3. Initial Code
import pandas as pd
import re
# Create a sample DataFrame
df = pd.DataFrame({'A': ['闽移需【202303】1111号“中的”闽移需【202303】1111号“提取出来', 'bjyxszd-闽移需【202303】15', '闽移需【202303】1510号']})
pattern = r'闽移需\【(\d{6})\】(\d{4})号'
result = df['A'].str.extract(pattern)Running this code produced a result where the second row was missing because it only had two digits after the brackets, so the regex did not match.
4. Issue Analysis
The regular expression only captures the numeric groups inside the parentheses; any part outside the groups (including the Chinese prefix and suffix) is not returned. Additionally, the pattern expects exactly four digits, causing rows with fewer digits to be ignored.
5. Refined Solution
The regex was updated to capture the entire pattern, including the prefix and suffix, and to make the four‑digit part optional when necessary. The revised pattern (shown in the image below) successfully extracts the full string such as "闽移需【202303】1311号".
6. Conclusion
This article demonstrated how to troubleshoot and refine a Python regular expression for extracting complex, mixed‑language patterns using pandas. The final solution captures the full target string, addressing the initial shortcomings.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
