Big Data 12 min read

Analyzing Python Job Trends from Zhaopin: Salary Distribution and Skill Word Clouds

This tutorial walks through extracting Python job postings from Zhaopin, storing them in MongoDB, cleaning the data with pandas, visualizing national and city‑level salary distributions, and generating word clouds of required skills using matplotlib and wordcloud, providing a complete end‑to‑end data analysis pipeline.

MaGe Linux Operations

Dec 14, 2018

Analyzing Python Job Trends from Zhaopin: Salary Distribution and Skill Word Clouds

Introduction

After crawling Python‑related job postings from the Zhaopin website and saving them into a MongoDB collection, this article demonstrates a full data‑analysis workflow using Python.

Main Analysis Steps

Read data from MongoDB

Clean and transform the dataset

Analyze salary distribution across China

Identify top cities by job count

Plot salary trends and histograms

Generate word clouds for required skills

Data Reading

import pymongo
import pandas as pd
client = pymongo.MongoClient('localhost')
db = client['zhilian']
table = db['python']
df = pd.DataFrame([record for record in table.find()], columns=['zwmc','gsmc','zwyx','gbsj','gzdd','fkl','brief','zw_link','_id','save_date'])
print('Total rows: {}'.format(df.shape[0]))

Data Cleaning

Convert the saved date string to a datetime type, filter salary strings that match the pattern \d+-\d+, split the salary range into minimum and maximum values, and convert them to numeric types.

# Convert save_date to datetime
df['save_date'] = pd.to_datetime(df['save_date'])
# Keep rows where salary has the form "XXXX-XXXX"
df_clean = df[[col for col in ['zwmc','gsmc','zwyx','gbsj','gzdd','fkl','brief','zw_link','_id','save_date']]]
df_clean = df_clean[df_clean['zwyx'].str.contains(r'\d+-\d+', regex=True)]
# Split salary range
s_min, s_max = df_clean['zwyx'].str.split('-', 1).str
df_clean['zwyx_min'] = pd.to_numeric(s_min)
df_clean['zwyx_max'] = pd.to_numeric(s_max)
# Remove duplicate job links
df_clean = df_clean.drop_duplicates(subset='zw_link')
print('Rows after cleaning: {}'.format(df_clean.shape[0]))

National Salary Distribution

Calculate the number of jobs per city, sort them, and plot the top 10 cities as a pie chart. Then create histograms for the minimum monthly salary, both with and without extreme values.

# City job count
city_counts = df_clean['gzdd'].value_counts().reset_index()
city_counts.columns = ['city','count']
city_counts['percentage'] = (city_counts['count']/city_counts['count'].sum()*100).round(2).astype(str) + '%'
# Plot pie chart of top 10 cities
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10,6))
ax.pie(city_counts['count'].head(10), labels=city_counts['city'].head(10), autopct='%1.1f%%')
ax.set_title('Top 10 Cities by Python Job Count')
plt.show()
# Histogram of minimum salary
import numpy as np
bins = [3000,6000,9000,12000,15000,18000,21000,24000,100000]
plt.hist(df_clean['zwyx_min'], bins=bins, edgecolor='black')
plt.title('Distribution of Minimum Monthly Salary')
plt.xlabel('Salary (RMB)')
plt.ylabel('Number of Jobs')
plt.show()

Adjusted Salary Distribution

Exclude extreme salary values (e.g., >20,000 RMB) and re‑plot the histogram.

df_adjust = df_clean[df_clean['zwyx_min'] <= 20000]
plt.hist(df_adjust['zwyx_min'], bins=bins, edgecolor='black')
plt.title('Adjusted Minimum Salary Distribution (≤20,000 RMB)')
plt.show()

City‑Specific Analysis (Beijing, Changsha)

Filter the dataset for specific cities, export the results to Excel, and plot city‑level salary distributions.

# Beijing analysis
beijing_df = df_clean[df_clean['gzdd'].str.contains('北京.*', regex=True)]
beijing_df.to_excel('zhilian_kw_python_beijing.xlsx')
print('Beijing rows: {}'.format(beijing_df.shape[0]))
# Changsha analysis
changsha_df = df_clean[df_clean['gzdd'].str.contains('长沙.*', regex=True)]
changsha_df.to_excel('zhilian_kw_python_changsha.xlsx')
print('Changsha rows: {}'.format(changsha_df.shape[0]))

Skill Word Clouds

Combine all job brief descriptions, segment Chinese text with jieba, and generate a word cloud shaped by a mask image.

import jieba
from wordcloud import WordCloud, ImageColorGenerator
import matplotlib.pyplot as plt
import os
from PIL import Image
import numpy as np
# Load all brief texts
brief_text = ' '.join(df_clean['brief'].astype(str).tolist())
# Segment text
words = ' '.join(jieba.cut(brief_text, cut_all=False))
# Load mask image
mask_path = os.path.join(os.path.dirname(__file__), 'colors.png')
mask = np.array(Image.open(mask_path))
# Generate word cloud
wc = WordCloud(background_color='#F0F8FF', max_words=100, mask=mask, max_font_size=300, random_state=42)
wc.generate(words)
# Recolor using mask image colors
image_colors = ImageColorGenerator(mask)
plt.imshow(wc.recolor(color_func=image_colors), interpolation='bilinear')
plt.axis('off')
plt.show()
# Save word cloud image
wc.to_file(os.path.join(os.path.dirname(__file__), 'brief_wordcloud.png'))

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

MongoDB Matplotlib Pandas word cloud salary distribution

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.