Scrape Zhaopin Job Listings with Python & Visualize Salary and Skill Trends
This tutorial walks through using Python's requests and BeautifulSoup to crawl Zhaopin job postings, extract detailed information, compute average salaries, perform keyword frequency analysis with jieba, and visualize results with histograms and word clouds for better career insights.
0. Introduction
This article builds on a basic Zhaopin web‑scraping tutorial, requiring Windows, Python 3.6, Sublime Text, and Chrome.
1. Find Job Links
Modify the regular expression to capture job detail URLs and company names from the search results page.
url = 'https://sou.zhaopin.com/jobs/searchresult.ashx?' + urlencode(paras)
try:
response = requests.get(url, headers=headers)Replace the above with a simpler requests.get call using the params argument.
url = 'https://sou.zhaopin.com/jobs/searchresult.ashx?'
try:
response = requests.get(url, params=paras, headers=headers)2. Calculate Average Salary
Salary strings are either xxxx-yyyy or 面议. The script extracts the numeric range and computes the average.
for item in items:
salary_average = 0
temp = item[3]
if temp != '面议':
idx = temp.find('-')
salary_average = (int(temp[0:idx]) + int(temp[idx+1:])) // 23. Parse Job Details
3.1 Web Page Parsing
After opening a job detail page, locate fields such as work experience, education, and company size using the HTML structure shown below.
# HTML structure example
<body>
<div class="terminalpage clearfix">
<div class="terminalpage-left">
<ul class="terminal-ul clearfix">
<li><span>工作经验:</span><strong>3-5年</strong></li>
<li><span>最低学历:</span><strong>本科</strong></li>
</ul>
</div>
<div class="terminalpage-right">
<div class="company-box">
<ul class="terminal-ul clearfix terminal-company mt20">
<li><span>公司规模:</span><strong>100-499人</strong></li>
</ul>
</div>
</div>
</div>
</body>3.2 Code Implementation
Use BeautifulSoup instead of regular expressions to extract the required fields.
from bs4 import BeautifulSoup
def get_job_detail(html):
soup = BeautifulSoup(html, 'html.parser')
# Extract experience and education
lis = soup.find_all('strong')
years = lis[4].get_text()
education = lis[5].get_text()
# Extract job responsibilities
requirement = ''
for terminalpage in soup.find_all('div', class_='terminalpage-main clearfix'):
for box in terminalpage.find_all('div', class_='tab-cont-box'):
cont = box.find_all('div', class_='tab-inner-cont')[0]
ps = cont.find_all('p')
for i in range(len(ps) - 1):
requirement += ps[i].get_text().replace('
', '').strip()
# Extract company scale
scale = soup.find(class_='terminal-ul clearfix terminal-company mt20').find_all('li')[0].strong.get_text()
return {'years': years, 'education': education, 'requirement': requirement, 'scale': scale}Write job descriptions to a .txt file and other fields to a CSV file using incremental writes to save memory.
def write_csv_rows(path, headers, rows):
with open(path, 'a', encoding='gb18030', newline='') as f:
f_csv = csv.DictWriter(f, headers)
if isinstance(rows, dict):
f_csv.writerow(rows)
else:
f_csv.writerows(rows) def write_txt_file(path, txt):
with open(path, 'a', encoding='gb18030', newline='') as f:
f.write(txt)4. Data Analysis
4.1 Salary Statistics
Read the salary column from the CSV, filter out entries marked as 面议, convert the remaining values to integers, and plot a histogram.
salaries = []
sal = read_csv_column(csv_filename, 3)
for i in range(len(sal) - 1):
if sal[i] != '0':
salaries.append(int(sal[i + 1]))
plt.hist(salaries, bins=10)
plt.show()4.2 Job Description Word Frequency
Read the saved job description text, segment it with jieba, remove stop words, and count frequencies using numpy and pandas.
import jieba, pandas as pd, numpy as np
content = read_txt_file(txt_filename)
segment = jieba.lcut(content)
words_df = pd.DataFrame({'segment': segment})
stopwords = pd.read_csv('stopwords.txt', header=None, names=['stopword'], encoding='utf-8')
words_df = words_df[~words_df.segment.isin(stopwords.stopword)]
words_stat = words_df.groupby('segment').agg({'segment': np.size}).reset_index().rename(columns={'segment':'count'}).sort_values('count', ascending=False)After refining the stop‑word list, the top terms include design, system, project, framework, algorithm, etc.
4.2.4 Word Cloud Visualization
Generate a word cloud from the frequency dictionary, using a mask image for shape.
from scipy.misc import imread
from wordcloud import WordCloud, ImageColorGenerator
color_mask = imread('background.jfif')
wordcloud = WordCloud(font_path='simhei.ttf', background_color='white', max_words=100, mask=color_mask, max_font_size=100, random_state=42, width=1000, height=860, margin=2)
word_frequence = {row['segment']: row['count'] for _, row in words_stat.head(100).iterrows()}
wordcloud.generate_from_frequencies(word_frequence)
image_colors = ImageColorGenerator(color_mask)
wordcloud.recolor(color_func=image_colors)
wordcloud.to_file('output.png')
plt.imshow(wordcloud)
plt.axis('off')
plt.show()5. Further Ideas
Analyze the relationship between years of experience and salary.
Compare salary differences across different job titles.
Use multithreading or multiprocessing to speed up crawling.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
