Fundamentals 7 min read

Extract Numbers from Pandas Series Strings and Compute Averages with Regex

This tutorial demonstrates how to use regular expressions and pandas string methods to extract numeric values from Series columns, calculate average years of experience, and convert salary strings into numeric monthly figures for further analysis.

Python Crawling & Data Mining

Jun 7, 2022

Extract Numbers from Pandas Series Strings and Compute Averages with Regex

Data Requirement

Extract numeric values from strings that follow a semi‑uniform pattern and perform calculations such as averaging years of experience or converting salary ranges.

import pandas as pd

df = pd.DataFrame({
    '年区间': ['1年以内', '无要求', '1-3年', '  3-5年  ', '5年以上'],
    '薪资': ['1.3-1.5万/月', '6-8千/月', '1.3万/月', '20-30万/年', ' 30-50万/年  ']
})

Solution Using re

Apply a regular expression r'(\d+)?-?(\d+)' to capture optional leading and trailing numbers, handling cases where a hyphen may appear once.

import re

pattern = re.compile(r'(\d+)?-?(\d+)')

Define a helper to compute the average of extracted numbers, rounding to the nearest integer.

def year_average(data):
    match = pattern.search(data)
    def average(nums):
        nums = tuple(nums)
        return round(sum(nums) / len(nums), 0)
    if match:
        return average([int(x) for x in match.groups() if x])
    return 0

df['平均年数'] = df['年区间'].apply(year_average)

The resulting DataFrame shows the calculated average years for each row.

Solution Using .str.extract

pandas provides the .str.extract method, which also accepts regular expressions.

df_dash = df['年区间'].str.extract(r'(\d+)?-?(\d+)')

Convert the extracted strings to numeric type, fill missing values with 0, and compute the row‑wise mean.

df['平均年数2'] = df_dash.astype(float).mean(axis=1).fillna(0).round(0)

Extracting Salary Numbers

Salary strings contain Chinese units (万, 千) and time units (月, 年). Replace these units with numeric multipliers before extraction.

df_dash = df['薪资'].replace({
    '万': '@10000',
    '千': '@1000',
    '月': '1',
    '年': '12'
}, regex=True)

Use a more complex regex to capture optional decimal numbers and the multipliers.

salary_pattern = re.compile(r'(\d+\.?\d*)?-?(\d+\.?\d*)?@(\d+)/(\d+)')

df_salary = df_dash.str.extract(r'(\d+\.?\d*)?-?(\d+\.?\d*)?@(\d+)/(\d+)').astype(float)

Convert the extracted components to monthly salary:

monthly = df_salary.apply(lambda x: (x[[0,1]] * x[2]) / x[3], axis=1)

Summary

The article demonstrates two pandas approaches—using Python's re module and the built‑in .str.extract method—to pull numeric information from string columns, compute averages, and transform salary ranges into numeric monthly values, highlighting the importance of regular‑expression design for effective data extraction.

采桑献春，移云遮阳。

于二零二二年四月十八日作

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python data extraction Pandas String processing salary conversion

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.