Analyzing Python Job Trends from Zhaopin: Salary Distribution and Skill Word Clouds
This tutorial walks through extracting Python job postings from Zhaopin, storing them in MongoDB, cleaning the data with pandas, visualizing national and city‑level salary distributions, and generating word clouds of required skills using matplotlib and wordcloud, providing a complete end‑to‑end data analysis pipeline.
Introduction
After crawling Python‑related job postings from the Zhaopin website and saving them into a MongoDB collection, this article demonstrates a full data‑analysis workflow using Python.
Main Analysis Steps
Read data from MongoDB
Clean and transform the dataset
Analyze salary distribution across China
Identify top cities by job count
Plot salary trends and histograms
Generate word clouds for required skills
Data Reading
import pymongo
import pandas as pd
client = pymongo.MongoClient('localhost')
db = client['zhilian']
table = db['python']
df = pd.DataFrame([record for record in table.find()], columns=['zwmc','gsmc','zwyx','gbsj','gzdd','fkl','brief','zw_link','_id','save_date'])
print('Total rows: {}'.format(df.shape[0]))Data Cleaning
Convert the saved date string to a datetime type, filter salary strings that match the pattern \d+-\d+, split the salary range into minimum and maximum values, and convert them to numeric types.
# Convert save_date to datetime
df['save_date'] = pd.to_datetime(df['save_date'])
# Keep rows where salary has the form "XXXX-XXXX"
df_clean = df[[col for col in ['zwmc','gsmc','zwyx','gbsj','gzdd','fkl','brief','zw_link','_id','save_date']]]
df_clean = df_clean[df_clean['zwyx'].str.contains(r'\d+-\d+', regex=True)]
# Split salary range
s_min, s_max = df_clean['zwyx'].str.split('-', 1).str
df_clean['zwyx_min'] = pd.to_numeric(s_min)
df_clean['zwyx_max'] = pd.to_numeric(s_max)
# Remove duplicate job links
df_clean = df_clean.drop_duplicates(subset='zw_link')
print('Rows after cleaning: {}'.format(df_clean.shape[0]))National Salary Distribution
Calculate the number of jobs per city, sort them, and plot the top 10 cities as a pie chart. Then create histograms for the minimum monthly salary, both with and without extreme values.
# City job count
city_counts = df_clean['gzdd'].value_counts().reset_index()
city_counts.columns = ['city','count']
city_counts['percentage'] = (city_counts['count']/city_counts['count'].sum()*100).round(2).astype(str) + '%'
# Plot pie chart of top 10 cities
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10,6))
ax.pie(city_counts['count'].head(10), labels=city_counts['city'].head(10), autopct='%1.1f%%')
ax.set_title('Top 10 Cities by Python Job Count')
plt.show()
# Histogram of minimum salary
import numpy as np
bins = [3000,6000,9000,12000,15000,18000,21000,24000,100000]
plt.hist(df_clean['zwyx_min'], bins=bins, edgecolor='black')
plt.title('Distribution of Minimum Monthly Salary')
plt.xlabel('Salary (RMB)')
plt.ylabel('Number of Jobs')
plt.show()Adjusted Salary Distribution
Exclude extreme salary values (e.g., >20,000 RMB) and re‑plot the histogram.
df_adjust = df_clean[df_clean['zwyx_min'] <= 20000]
plt.hist(df_adjust['zwyx_min'], bins=bins, edgecolor='black')
plt.title('Adjusted Minimum Salary Distribution (≤20,000 RMB)')
plt.show()City‑Specific Analysis (Beijing, Changsha)
Filter the dataset for specific cities, export the results to Excel, and plot city‑level salary distributions.
# Beijing analysis
beijing_df = df_clean[df_clean['gzdd'].str.contains('北京.*', regex=True)]
beijing_df.to_excel('zhilian_kw_python_beijing.xlsx')
print('Beijing rows: {}'.format(beijing_df.shape[0]))
# Changsha analysis
changsha_df = df_clean[df_clean['gzdd'].str.contains('长沙.*', regex=True)]
changsha_df.to_excel('zhilian_kw_python_changsha.xlsx')
print('Changsha rows: {}'.format(changsha_df.shape[0]))Skill Word Clouds
Combine all job brief descriptions, segment Chinese text with jieba, and generate a word cloud shaped by a mask image.
import jieba
from wordcloud import WordCloud, ImageColorGenerator
import matplotlib.pyplot as plt
import os
from PIL import Image
import numpy as np
# Load all brief texts
brief_text = ' '.join(df_clean['brief'].astype(str).tolist())
# Segment text
words = ' '.join(jieba.cut(brief_text, cut_all=False))
# Load mask image
mask_path = os.path.join(os.path.dirname(__file__), 'colors.png')
mask = np.array(Image.open(mask_path))
# Generate word cloud
wc = WordCloud(background_color='#F0F8FF', max_words=100, mask=mask, max_font_size=300, random_state=42)
wc.generate(words)
# Recolor using mask image colors
image_colors = ImageColorGenerator(mask)
plt.imshow(wc.recolor(color_func=image_colors), interpolation='bilinear')
plt.axis('off')
plt.show()
# Save word cloud image
wc.to_file(os.path.join(os.path.dirname(__file__), 'brief_wordcloud.png'))Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
