Detailed Job Description Extraction and Data Cleaning with Python and MongoDB

This article explains how to scrape detailed job description and address information from online job portals, use Python libraries such as requests, BeautifulSoup4, and pymongo for crawling, and then clean and normalize the collected data including publish dates, salaries, and work‑experience levels before storing it in MongoDB.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Detailed Job Description Extraction and Data Cleaning with Python and MongoDB

1. Extract Detailed Job Description Information

Detail Page Analysis

In the detail page, the most important fields are 职位描述 (job description) and 工作地址 (work address). However, the HTML places 岗位职责 (responsibilities) and 任职要求 (requirements) in the same div, making it difficult to separate them during scraping.

Libraries Used for Crawling

Used libraries:

requests
BeautifulSoup4
pymongo

Python Code

"""
@author: jtahstu
@contact: [email protected]
@site: http://www.jtahstu.com
"""
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import time
from pymongo import MongoClient
headers = {
    'x-devtools-emulate-network-conditions-client-id': "5f2fc4da-c727-43c0-aad4-37fce8e3ff39",
    'upgrade-insecure-requests': "1",
    'user-agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36",
    'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    'dnt': "1",
    'accept-encoding': "gzip, deflate",
    'accept-language': "zh-CN,zh;q=0.8,en;q=0.6",
    'cookie': "__c=1501326829; ...",
    'cache-control': "no-cache",
    'postman-token': "76554687-c4df-0c17-7cc0-5bf3845c9831"
}
conn = MongoClient('127.0.0.1', 27017)
db = conn.iApp  # Connect to mydb, create if not exists

def init():
    items = db.jobs_php.find().sort('pid')
    for item in items:
        if 'detail' in item.keys():
            continue
        detail_url = "https://www.zhipin.com/job_detail/%s.html?ka=search_list_1" % item['pid']
        print(detail_url)
        html = requests.get(detail_url, headers=headers)
        if html.status_code != 200:
            print('status_code is %d' % html.status_code)
            break
        soup = BeautifulSoup(html.text, "html.parser")
        job = soup.select('.job-sec .text')
        if len(job) < 1:
            continue
        item['detail'] = job[0].text.strip()
        location = soup.select('.job-sec .job-location')
        item['location'] = location[0].text.strip()
        item['updated_at'] = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
        res = save(item)
        print(res)
        time.sleep(40)

def save(item):
    return db.jobs_php.update_one({"_id": item['_id']}, {"$set": item})

if __name__ == "__main__":
    init()

The code is straightforward and easy for beginners to understand.

2. Data Cleaning

2.1 Correct Publish Date

"time" : "发布于03月31日",
"time" : "发布于昨天",
"time" : "发布于11:31"

These strings are normalized to a standard date format:

import datetime
from pymongo import MongoClient

def update(data):
    return db.jobs_php.update_one({"_id": data['_id']}, {"$set": data})

def clear_time():
    items = db.jobs_php.find({})
    for item in items:
        if not item['time'].find('布于'):
            continue
        item['time'] = item['time'].replace("发布于", "2017-")
        item['time'] = item['time'].replace("月", "-")
        item['time'] = item['time'].replace("日", "")
        if "昨天" in item['time']:
            item['time'] = str(datetime.date.today() - datetime.timedelta(days=1))
        elif ":" in item['time']:
            item['time'] = str(datetime.date.today())
        update(item)
    print('ok')

2.2 Normalize Salary to Numeric Values

'''"salary" : "5K-12K"'''
# Convert to:
"salary" : {
    "low" : 5000,
    "high" : 12000,
    "avg" : 8500.0
}
def clear_salary():
    items = db.jobs_lagou_php.find({})
    for item in items:
        if isinstance(item['salary'], dict):
            continue
        salary_list = item['salary'].lower().replace("k", "000").split("-")
        if len(salary_list) != 2:
            print(salary_list)
            continue
        try:
            salary_list = [int(x) for x in salary_list]
        except:
            print(salary_list)
            continue
        item['salary'] = {
            'low': salary_list[0],
            'high': salary_list[1],
            'avg': (salary_list[0] + salary_list[1]) / 2
        }
        update(item)
    print('ok')

2.3 Classify Job Levels by Work Experience

# Standardize workYear field
if item['workYear'] == '应届毕业生':
    item['workYear'] = '应届生'
elif item['workYear'] == '1年以下':
    item['workYear'] = '1年以内'
elif item['workYear'] == '不限':
    item['workYear'] = '经验不限'
update_lagou(item)

After standardization, assign a numeric level:

def set_level():
    items = db.jobs_zhipin_php.find({})
    for item in items:
        if item['workYear'] == '应届生':
            item['level'] = 1
        elif item['workYear'] == '1年以内':
            item['level'] = 2
        elif item['workYear'] == '1-3年':
            item['level'] = 3
        elif item['workYear'] == '3-5年':
            item['level'] = 4
        elif item['workYear'] == '5-10年':
            item['level'] = 5
        elif item['workYear'] == '10年以上':
            item['level'] = 6
        elif item['workYear'] == '经验不限':
            item['level'] = 10
        update(item)
    print('ok')

Note: Positions with "经验不限" often contain all requirements in the job description, so their level data may be discarded later for accurate statistics.

Source: Compiled from the web; all rights belong to the original author.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Pythondata cleaningMongoDBbeautifulsoupjob posting
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.