Fundamentals 5 min read

How to Extract Complex Patterns with Python Regex and Pandas

This article walks through a Python regex challenge of extracting strings that start with a specific Chinese prefix, contain six digits, four digits, and the character "号", showing the original issue, code attempts, and the final refined solution using pandas.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
How to Extract Complex Patterns with Python Regex and Pandas

1. Introduction

Hello, I'm PiPi. Recently a member of a Python community asked how to extract data matching a specific pattern using Python, so I’m sharing the solution here.

2. Problem Description

The goal is to extract strings that begin with "闽移需【" followed by six consecutive digits, then "】", four consecutive digits, and the Chinese character "号" (e.g., "闽移需【202303】1311号"). The original attempt failed to capture the Chinese characters.

3. Initial Code

import pandas as pd
import re

# Create a sample DataFrame
df = pd.DataFrame({'A': ['闽移需【202303】1111号“中的”闽移需【202303】1111号“提取出来', 'bjyxszd-闽移需【202303】15', '闽移需【202303】1510号']})

pattern = r'闽移需\【(\d{6})\】(\d{4})号'
result = df['A'].str.extract(pattern)

Running this code produced a result where the second row was missing because it only had two digits after the brackets, so the regex did not match.

4. Issue Analysis

The regular expression only captures the numeric groups inside the parentheses; any part outside the groups (including the Chinese prefix and suffix) is not returned. Additionally, the pattern expects exactly four digits, causing rows with fewer digits to be ignored.

5. Refined Solution

The regex was updated to capture the entire pattern, including the prefix and suffix, and to make the four‑digit part optional when necessary. The revised pattern (shown in the image below) successfully extracts the full string such as "闽移需【202303】1311号".

6. Conclusion

This article demonstrated how to troubleshoot and refine a Python regular expression for extracting complex, mixed‑language patterns using pandas. The final solution captures the full target string, addressing the initial shortcomings.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonregexpandasString Parsing
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.