Extract Work Experience Years from CSV Using Python Regex: 4 Practical Methods
This article addresses a common data cleaning challenge by demonstrating four Python-based solutions—including custom functions, regex searches, and pandas string methods—to extract numeric work experience values from CSV entries, complete with code snippets and visual results for easy replication.
1. Introduction
In a recent Python community question, a user asked how to extract numeric work experience years from a CSV column for later regression analysis.
2. Solution Process
Four methods are presented, two contributed by the author and two by another contributor.
Method 1
Custom function using conditional checks and regular expressions to handle various formats.
def work_year(y):
y = y.strip()
if y == '无需经验':
return 0
elif y == '在校生/应届生':
return 0
elif '-' in y and '年经验' in y:
low_experience = re.findall(re.compile('(
d*\.?\d+)'), y)[0]
high_experience = re.findall(re.compile('(
d?\.?\d+)'), y)[1]
s = round((float(low_experience) + float(high_experience)) / 2, 0)
return s
elif '年经验' in y or '年以上经验' in y:
year = re.findall(re.compile('^(\d+)'), y)[0]
return year
else:
return y
df['new'] = df['工作经验'].apply(work_year)
df.head()Result shown in the following image:
Method 2
Similar logic with a slight variation in handling hyphenated ranges.
def work_year(y):
if y == '无需经验':
return 0
elif y == '在校生/应届生':
return 0
elif '-' in y:
low_experience = re.findall(re.compile('(\d*\.?\d+)'), y)[0]
high_experience = re.findall(re.compile('(\d?\.?\d+)'), y)[1]
s = round((float(low_experience) + float(high_experience)) / 2, 0)
return s
elif y[0].isnumeric():
year = re.findall(re.compile('^(\d+)'), y)[0]
return year
else:
return y
df['col1'] = df['工作经验'].str.strip().apply(work_year)
dfMethod 3
Uses a single regular expression to capture one or two numbers and computes their average.
def work_year(y):
search_year = re.search(r'(\d+)?-?(\d+)', y)
def average(args):
x = tuple(args)
length = len(x)
return round(sum(x) / length, 0)
if search_year:
return average([int(i) for i in search_year.groups() if i])
else:
return 0
df['new1'] = df['工作经验'].apply(work_year)Method 4
Leverages pandas str.extract with the same regex and computes the mean directly.
df['new2'] = df['工作经验'].str.extract(r'(\d+)?-?(\d+)').astype(float).mean(axis=1).fillna(0).round(0)3. Summary
The four approaches demonstrate practical ways to parse work experience strings in a CSV file, allowing the extracted numeric values to be used for further analysis such as multiple regression.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
