Fundamentals 6 min read

Extract Work Experience Years from CSV Using Python Regex: 4 Practical Methods

This article addresses a common data cleaning challenge by demonstrating four Python-based solutions—including custom functions, regex searches, and pandas string methods—to extract numeric work experience values from CSV entries, complete with code snippets and visual results for easy replication.

Python Crawling & Data Mining

Apr 19, 2022

Extract Work Experience Years from CSV Using Python Regex: 4 Practical Methods

1. Introduction

In a recent Python community question, a user asked how to extract numeric work experience years from a CSV column for later regression analysis.

2. Solution Process

Four methods are presented, two contributed by the author and two by another contributor.

Method 1

Custom function using conditional checks and regular expressions to handle various formats.

def work_year(y):
    y = y.strip()
    if y == '无需经验':
        return 0
    elif y == '在校生/应届生':
        return 0
    elif '-' in y and '年经验' in y:
        low_experience = re.findall(re.compile('(
        d*\.?\d+)'), y)[0]
        high_experience = re.findall(re.compile('(
        d?\.?\d+)'), y)[1]
        s = round((float(low_experience) + float(high_experience)) / 2, 0)
        return s
    elif '年经验' in y or '年以上经验' in y:
        year = re.findall(re.compile('^(\d+)'), y)[0]
        return year
    else:
        return y

df['new'] = df['工作经验'].apply(work_year)
df.head()

Result shown in the following image:

Method 2

Similar logic with a slight variation in handling hyphenated ranges.

def work_year(y):
    if y == '无需经验':
        return 0
    elif y == '在校生/应届生':
        return 0
    elif '-' in y:
        low_experience = re.findall(re.compile('(\d*\.?\d+)'), y)[0]
        high_experience = re.findall(re.compile('(\d?\.?\d+)'), y)[1]
        s = round((float(low_experience) + float(high_experience)) / 2, 0)
        return s
    elif y[0].isnumeric():
        year = re.findall(re.compile('^(\d+)'), y)[0]
        return year
    else:
        return y

df['col1'] = df['工作经验'].str.strip().apply(work_year)
df

Method 3

Uses a single regular expression to capture one or two numbers and computes their average.

def work_year(y):
    search_year = re.search(r'(\d+)?-?(\d+)', y)

    def average(args):
        x = tuple(args)
        length = len(x)
        return round(sum(x) / length, 0)

    if search_year:
        return average([int(i) for i in search_year.groups() if i])
    else:
        return 0

df['new1'] = df['工作经验'].apply(work_year)

Method 4

Leverages pandas str.extract with the same regex and computes the mean directly.

df['new2'] = df['工作经验'].str.extract(r'(\d+)?-?(\d+)').astype(float).mean(axis=1).fillna(0).round(0)

3. Summary

The four approaches demonstrate practical ways to parse work experience strings in a CSV file, allowing the extracted numeric values to be used for further analysis such as multiple regression.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data cleaning Pandas

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.