Fundamentals 5 min read

How to Split and Extract Numeric Data from Complex Excel Columns Using Pandas

This article walks through a real‑world Python data‑processing problem where a messy Excel column is parsed into separate numeric fields, presenting two Pandas solutions—one using regular expressions and another using string splitting—along with complete code examples and practical tips.

Python Crawling & Data Mining

Jul 16, 2024

How to Split and Extract Numeric Data from Complex Excel Columns Using Pandas

1. Introduction

Hello, I am a Python enthusiast. Recently, a member of a Python community asked why a piece of code that extracts numbers from a column sometimes fails.

2. Solution Approach

A mentor suggested a straightforward method, but the code looked confusing. Another participant offered an alternative that avoids regular expressions.

Solution 1: Using Regular Expressions

test = pd.read_excel("测试数据.xlsx")

extract_cols = test.columns.drop('费用明细')
for c in extract_cols:
    test[c] = test['费用明细'].str.extract(fr'{c}.*?(\d+\.?\d*)').astype('float64')

test.to_excel("测试数据-结果.xlsx", index=False)

The above code reads the Excel file, drops the original "费用明细" column from the list of columns to extract, then uses str.extract with a dynamic regular expression to pull numeric values into new columns, finally saving the result.

Solution 2: Splitting Without Regular Expressions

test = pd.read_excel("测试数据.xlsx")
extract_cols = test.columns.drop('费用明细')

test['费用明细-c'] = test['费用明细'].str.split(',')
test = test.explode('费用明细-c')

test[['费用明细-c', '费用明细-d']] = test['费用明细-c'].str.split(' ', expand=True)
test['费用明细-d'] = test['费用明细-d'].str.strip('元').astype('float64')
test.loc[test['费用明细-c'].str.contains('平台加价'), '费用明细-c'] = '平台加价'

test = test[test['费用明细-d'].notna()]
testc = test.groupby('费用明细', sort=False)[['费用明细-c', '费用明细-d']].apply(lambda x: x.set_index('费用明细-c').T).reset_index(level=-1, drop=True)

testc.reindex(columns=extract_cols).reset_index().to_excel("测试数据-结果.xlsx", index=False)

This method splits the original column by commas, explodes the list into rows, then splits each part by spaces to separate the label and the numeric value, cleans the numeric column, and finally reconstructs the dataframe to match the original column order.

Both solutions successfully resolved the community member's issue.

If you encounter similar Python data‑processing questions, feel free to join the discussion group for help.

3. Summary

The post presented a Python data‑handling problem, offered two concrete Pandas implementations—one regex‑based and one split‑based—and demonstrated how to clean and restructure Excel data efficiently.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python data extraction data cleaning Excel string-splitting

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.