What Do Python Job Listings Reveal? Salary and Skill Insights Across Chinese Cities
This tutorial walks through analyzing Python job postings scraped from Zhaopin, covering data extraction from MongoDB, cleaning, salary distribution analysis, top city rankings, and generating word clouds of required skills, with complete Python code and visualizations.
Overview
This article demonstrates a step‑by‑step analysis of Python job postings collected from the Zhaopin website. It explores salary distribution, the top hiring cities in China, and the most frequent skill requirements using Python libraries such as pymongo, pandas, matplotlib, and wordcloud.
Data Retrieval and Preparation
The raw data is stored in a MongoDB database. It is loaded into a pandas DataFrame for further processing.
import pymongo
import pandas as pd
client = pymongo.MongoClient('localhost')
db = client['zhilian']
table = db['python']
# Load all records into a DataFrame
df = pd.DataFrame(list(table.find()), columns=['zwmc','gsmc','zwyx','gbsj','gzdd','fkl','brief','zw_link','_id','save_date'])
print('Total rows: {} rows'.format(df.shape[0]))Data Cleaning
Key cleaning steps include converting the saved date to datetime, filtering salary strings that match the pattern XXXX-XXXX, splitting the salary range into minimum and maximum values, and converting these values to numeric types.
# Convert save_date to datetime
df['save_date'] = pd.to_datetime(df['save_date'])
# Keep only rows where salary has the form "number-number"
df_clean = df[df['zwyx'].str.contains(r'\d+-\d+', regex=True)].copy()
# Split salary into min and max
df_clean[['zwyx_min','zwyx_max']] = df_clean['zwyx'].str.split('-', expand=True)
df_clean['zwyx_min'] = pd.to_numeric(df_clean['zwyx_min'])
df_clean['zwyx_max'] = pd.to_numeric(df_clean['zwyx_max'])Removing Duplicates
Duplicate job postings are identified by the URL field and removed.
# Check for duplicate URLs
duplicates = df_clean.duplicated(subset='zw_link')
print('Duplicates found:', duplicates.sum())
# Keep only unique entries
df_clean = df_clean[~duplicates]Salary Distribution Across China
The cleaned data is used to compute the number of postings per city, rank the top 10 cities, and visualise the distribution with a pie chart and a histogram of minimum salaries.
# List of major cities to consider
ADDRESS = ['北京','上海','广州','深圳','天津','武汉','西安','成都','大连','长春','沈阳','南京','济南','青岛','杭州','苏州','无锡','宁波','重庆','郑州','长沙','福州','厦门','哈尔滨','石家庄','合肥','惠州','太原','昆明','烟台','佛山','南昌','贵阳','南宁']
# Extract city name from location field
df_city = df_clean.copy()
for city in ADDRESS:
df_city['gzdd'] = df_city['gzdd'].replace(city + '.*', city, regex=True)
# Count postings per city
df_city_counts = df_city.groupby('gzdd')[['zwmc','gsmc']].count()
df_city_counts['percentage'] = (df_city_counts['zwmc'] / df_city_counts['zwmc'].sum() * 100).round(2)
df_city_counts = df_city_counts.rename(columns={'zwmc':'number'}).reset_index()
df_city_counts['label'] = df_city_counts['gzdd'] + ' ' + df_city_counts['percentage'].astype(str) + '%'
# Plot pie chart of city distribution
import matplotlib.pyplot as plt
sizes = df_city_counts['number']
labels = None # omit labels for clarity
plt.figure(figsize=(10,6))
plt.pie(sizes, colors=plt.cm.PiYG(np.arange(len(sizes))/len(sizes)), startangle=0, shadow=False)
plt.title('职位数量分布', loc='center')
plt.savefig('job_distribute.jpg')
plt.show()A histogram of the minimum monthly salary is also generated.
# Histogram of minimum salary
bins = [3000,6000,9000,12000,15000,18000,21000,24000,100000]
plt.figure(figsize=(10,8))
plt.hist(df_clean['zwyx_min'], bins=bins, density=True, histtype='bar', facecolor='g', rwidth=0.8)
plt.title('Hist of min monthly salary in China', size=14)
plt.xlabel('min monthly salary (RMB)')
plt.ylabel('Frequency')
plt.savefig('salary_quanguo_min.jpg')
plt.show()City‑Specific Analysis (Beijing and Changsha)
Separate DataFrames are created for Beijing and Changsha to examine local salary trends and skill requirements.
# Beijing specific data
df_beijing = df_clean[df_clean['gzdd'].str.contains('北京.*', regex=True)]
df_beijing.to_excel('zhilian_kw_python_bj.xlsx')
print('Beijing rows:', df_beijing.shape[0])
# Changsha specific data
df_changsha = df_clean[df_clean['gzdd'].str.contains('长沙.*', regex=True)]
df_changsha.to_excel('zhilian_kw_python_cs.xlsx')
print('Changsha rows:', df_changsha.shape[0])Skill Word Cloud Generation
The job description field brief is concatenated, tokenised with jieba, and visualised as a word cloud.
import jieba
from wordcloud import WordCloud, ImageColorGenerator
import matplotlib.pyplot as plt
import os
from PIL import Image
import numpy as np
# Load concatenated brief text
with open('brief_quanguo.txt','rb') as f:
text = f.read().decode('utf-8')
# Tokenise Chinese text
wordlist = jieba.cut(text, cut_all=False)
wordlist_space_split = ' '.join(wordlist)
# Load mask image for word cloud shape
d = os.path.dirname(__file__)
mask = np.array(Image.open(os.path.join(d,'colors.png')))
wc = WordCloud(background_color='#F0F8FF', max_words=100, mask=mask, max_font_size=300, random_state=42)
wc.generate(wordlist_space_split)
image_colors = ImageColorGenerator(mask)
plt.imshow(wc.recolor(color_func=image_colors), interpolation='bilinear')
plt.axis('off')
plt.show()
wc.to_file(os.path.join(d,'brief_quanguo_colors_cloud.png'))Conclusion
The analysis reveals that Beijing, Shanghai, and Shenzhen dominate the Python job market in China, with average minimum salaries ranging from 3,000 to 12,000 RMB. The word‑cloud visualisation highlights frequently demanded skills such as "爬虫" (web crawling), "Django", and "数据分析" (data analysis). The provided Python scripts can be adapted to other datasets for similar market‑trend investigations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
