Backend Development 16 min read

Scrape Zhaopin Job Listings with Python & Visualize Salary and Skill Trends

This tutorial walks through using Python's requests and BeautifulSoup to crawl Zhaopin job postings, extract detailed information, compute average salaries, perform keyword frequency analysis with jieba, and visualize results with histograms and word clouds for better career insights.

MaGe Linux Operations

Sep 17, 2018

Scrape Zhaopin Job Listings with Python & Visualize Salary and Skill Trends

0. Introduction

This article builds on a basic Zhaopin web‑scraping tutorial, requiring Windows, Python 3.6, Sublime Text, and Chrome.

1. Find Job Links

Modify the regular expression to capture job detail URLs and company names from the search results page.

url = 'https://sou.zhaopin.com/jobs/searchresult.ashx?' + urlencode(paras)
try:
    response = requests.get(url, headers=headers)

Replace the above with a simpler requests.get call using the params argument.

url = 'https://sou.zhaopin.com/jobs/searchresult.ashx?'
try:
    response = requests.get(url, params=paras, headers=headers)

2. Calculate Average Salary

Salary strings are either xxxx-yyyy or 面议. The script extracts the numeric range and computes the average.

for item in items:
    salary_average = 0
    temp = item[3]
    if temp != '面议':
        idx = temp.find('-')
        salary_average = (int(temp[0:idx]) + int(temp[idx+1:])) // 2

3. Parse Job Details

3.1 Web Page Parsing

After opening a job detail page, locate fields such as work experience, education, and company size using the HTML structure shown below.

# HTML structure example
<body>
    <div class="terminalpage clearfix">
        <div class="terminalpage-left">
            <ul class="terminal-ul clearfix">
                <li><span>工作经验:</span><strong>3-5年</strong></li>
                <li><span>最低学历:</span><strong>本科</strong></li>
            </ul>
        </div>
        <div class="terminalpage-right">
            <div class="company-box">
                <ul class="terminal-ul clearfix terminal-company mt20">
                    <li><span>公司规模:</span><strong>100-499人</strong></li>
                </ul>
            </div>
        </div>
    </div>
</body>

3.2 Code Implementation

Use BeautifulSoup instead of regular expressions to extract the required fields.

from bs4 import BeautifulSoup

def get_job_detail(html):
    soup = BeautifulSoup(html, 'html.parser')
    # Extract experience and education
    lis = soup.find_all('strong')
    years = lis[4].get_text()
    education = lis[5].get_text()
    # Extract job responsibilities
    requirement = ''
    for terminalpage in soup.find_all('div', class_='terminalpage-main clearfix'):
        for box in terminalpage.find_all('div', class_='tab-cont-box'):
            cont = box.find_all('div', class_='tab-inner-cont')[0]
            ps = cont.find_all('p')
            for i in range(len(ps) - 1):
                requirement += ps[i].get_text().replace('
', '').strip()
    # Extract company scale
    scale = soup.find(class_='terminal-ul clearfix terminal-company mt20').find_all('li')[0].strong.get_text()
    return {'years': years, 'education': education, 'requirement': requirement, 'scale': scale}

Write job descriptions to a .txt file and other fields to a CSV file using incremental writes to save memory.

def write_csv_rows(path, headers, rows):
    with open(path, 'a', encoding='gb18030', newline='') as f:
        f_csv = csv.DictWriter(f, headers)
        if isinstance(rows, dict):
            f_csv.writerow(rows)
        else:
            f_csv.writerows(rows)

def write_txt_file(path, txt):
    with open(path, 'a', encoding='gb18030', newline='') as f:
        f.write(txt)

4. Data Analysis

4.1 Salary Statistics

Read the salary column from the CSV, filter out entries marked as 面议, convert the remaining values to integers, and plot a histogram.

salaries = []
sal = read_csv_column(csv_filename, 3)
for i in range(len(sal) - 1):
    if sal[i] != '0':
        salaries.append(int(sal[i + 1]))
plt.hist(salaries, bins=10)
plt.show()

4.2 Job Description Word Frequency

Read the saved job description text, segment it with jieba, remove stop words, and count frequencies using numpy and pandas.

import jieba, pandas as pd, numpy as np
content = read_txt_file(txt_filename)
segment = jieba.lcut(content)
words_df = pd.DataFrame({'segment': segment})
stopwords = pd.read_csv('stopwords.txt', header=None, names=['stopword'], encoding='utf-8')
words_df = words_df[~words_df.segment.isin(stopwords.stopword)]
words_stat = words_df.groupby('segment').agg({'segment': np.size}).reset_index().rename(columns={'segment':'count'}).sort_values('count', ascending=False)

After refining the stop‑word list, the top terms include design, system, project, framework, algorithm, etc.

4.2.4 Word Cloud Visualization

Generate a word cloud from the frequency dictionary, using a mask image for shape.

from scipy.misc import imread
from wordcloud import WordCloud, ImageColorGenerator
color_mask = imread('background.jfif')
wordcloud = WordCloud(font_path='simhei.ttf', background_color='white', max_words=100, mask=color_mask, max_font_size=100, random_state=42, width=1000, height=860, margin=2)
word_frequence = {row['segment']: row['count'] for _, row in words_stat.head(100).iterrows()}
wordcloud.generate_from_frequencies(word_frequence)
image_colors = ImageColorGenerator(color_mask)
wordcloud.recolor(color_func=image_colors)
wordcloud.to_file('output.png')
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

5. Further Ideas

Analyze the relationship between years of experience and salary.

Compare salary differences across different job titles.

Use multithreading or multiprocessing to speed up crawling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Salary Visualization

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.