Detailed Job Description Extraction and Data Cleaning with Python and MongoDB
This article explains how to scrape detailed job description and address information from online job portals, use Python libraries such as requests, BeautifulSoup4, and pymongo for crawling, and then clean and normalize the collected data including publish dates, salaries, and work‑experience levels before storing it in MongoDB.
1. Extract Detailed Job Description Information
Detail Page Analysis
In the detail page, the most important fields are 职位描述 (job description) and 工作地址 (work address). However, the HTML places 岗位职责 (responsibilities) and 任职要求 (requirements) in the same div, making it difficult to separate them during scraping.
Libraries Used for Crawling
Used libraries:
requests
BeautifulSoup4
pymongoPython Code
"""
@author: jtahstu
@contact: [email protected]
@site: http://www.jtahstu.com
"""
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import time
from pymongo import MongoClient
headers = {
'x-devtools-emulate-network-conditions-client-id': "5f2fc4da-c727-43c0-aad4-37fce8e3ff39",
'upgrade-insecure-requests': "1",
'user-agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36",
'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
'dnt': "1",
'accept-encoding': "gzip, deflate",
'accept-language': "zh-CN,zh;q=0.8,en;q=0.6",
'cookie': "__c=1501326829; ...",
'cache-control': "no-cache",
'postman-token': "76554687-c4df-0c17-7cc0-5bf3845c9831"
}
conn = MongoClient('127.0.0.1', 27017)
db = conn.iApp # Connect to mydb, create if not exists
def init():
items = db.jobs_php.find().sort('pid')
for item in items:
if 'detail' in item.keys():
continue
detail_url = "https://www.zhipin.com/job_detail/%s.html?ka=search_list_1" % item['pid']
print(detail_url)
html = requests.get(detail_url, headers=headers)
if html.status_code != 200:
print('status_code is %d' % html.status_code)
break
soup = BeautifulSoup(html.text, "html.parser")
job = soup.select('.job-sec .text')
if len(job) < 1:
continue
item['detail'] = job[0].text.strip()
location = soup.select('.job-sec .job-location')
item['location'] = location[0].text.strip()
item['updated_at'] = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
res = save(item)
print(res)
time.sleep(40)
def save(item):
return db.jobs_php.update_one({"_id": item['_id']}, {"$set": item})
if __name__ == "__main__":
init()The code is straightforward and easy for beginners to understand.
2. Data Cleaning
2.1 Correct Publish Date
"time" : "发布于03月31日",
"time" : "发布于昨天",
"time" : "发布于11:31"These strings are normalized to a standard date format:
import datetime
from pymongo import MongoClient
def update(data):
return db.jobs_php.update_one({"_id": data['_id']}, {"$set": data})
def clear_time():
items = db.jobs_php.find({})
for item in items:
if not item['time'].find('布于'):
continue
item['time'] = item['time'].replace("发布于", "2017-")
item['time'] = item['time'].replace("月", "-")
item['time'] = item['time'].replace("日", "")
if "昨天" in item['time']:
item['time'] = str(datetime.date.today() - datetime.timedelta(days=1))
elif ":" in item['time']:
item['time'] = str(datetime.date.today())
update(item)
print('ok')2.2 Normalize Salary to Numeric Values
'''"salary" : "5K-12K"'''
# Convert to:
"salary" : {
"low" : 5000,
"high" : 12000,
"avg" : 8500.0
} def clear_salary():
items = db.jobs_lagou_php.find({})
for item in items:
if isinstance(item['salary'], dict):
continue
salary_list = item['salary'].lower().replace("k", "000").split("-")
if len(salary_list) != 2:
print(salary_list)
continue
try:
salary_list = [int(x) for x in salary_list]
except:
print(salary_list)
continue
item['salary'] = {
'low': salary_list[0],
'high': salary_list[1],
'avg': (salary_list[0] + salary_list[1]) / 2
}
update(item)
print('ok')2.3 Classify Job Levels by Work Experience
# Standardize workYear field
if item['workYear'] == '应届毕业生':
item['workYear'] = '应届生'
elif item['workYear'] == '1年以下':
item['workYear'] = '1年以内'
elif item['workYear'] == '不限':
item['workYear'] = '经验不限'
update_lagou(item)After standardization, assign a numeric level:
def set_level():
items = db.jobs_zhipin_php.find({})
for item in items:
if item['workYear'] == '应届生':
item['level'] = 1
elif item['workYear'] == '1年以内':
item['level'] = 2
elif item['workYear'] == '1-3年':
item['level'] = 3
elif item['workYear'] == '3-5年':
item['level'] = 4
elif item['workYear'] == '5-10年':
item['level'] = 5
elif item['workYear'] == '10年以上':
item['level'] = 6
elif item['workYear'] == '经验不限':
item['level'] = 10
update(item)
print('ok')Note: Positions with "经验不限" often contain all requirements in the job description, so their level data may be discarded later for accurate statistics.
Source: Compiled from the web; all rights belong to the original author.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
