Fundamentals 14 min read

What Do Python Job Listings Reveal? Salary and Skill Insights Across Chinese Cities

This tutorial walks through analyzing Python job postings scraped from Zhaopin, covering data extraction from MongoDB, cleaning, salary distribution analysis, top city rankings, and generating word clouds of required skills, with complete Python code and visualizations.

MaGe Linux Operations

Nov 27, 2017

What Do Python Job Listings Reveal? Salary and Skill Insights Across Chinese Cities

Overview

This article demonstrates a step‑by‑step analysis of Python job postings collected from the Zhaopin website. It explores salary distribution, the top hiring cities in China, and the most frequent skill requirements using Python libraries such as pymongo, pandas, matplotlib, and wordcloud.

Data Retrieval and Preparation

The raw data is stored in a MongoDB database. It is loaded into a pandas DataFrame for further processing.

import pymongo
import pandas as pd

client = pymongo.MongoClient('localhost')
 db = client['zhilian']
 table = db['python']

# Load all records into a DataFrame
 df = pd.DataFrame(list(table.find()), columns=['zwmc','gsmc','zwyx','gbsj','gzdd','fkl','brief','zw_link','_id','save_date'])
print('Total rows: {} rows'.format(df.shape[0]))

Data Cleaning

Key cleaning steps include converting the saved date to datetime, filtering salary strings that match the pattern XXXX-XXXX, splitting the salary range into minimum and maximum values, and converting these values to numeric types.

# Convert save_date to datetime
 df['save_date'] = pd.to_datetime(df['save_date'])

# Keep only rows where salary has the form "number-number"
 df_clean = df[df['zwyx'].str.contains(r'\d+-\d+', regex=True)].copy()

# Split salary into min and max
 df_clean[['zwyx_min','zwyx_max']] = df_clean['zwyx'].str.split('-', expand=True)
 df_clean['zwyx_min'] = pd.to_numeric(df_clean['zwyx_min'])
 df_clean['zwyx_max'] = pd.to_numeric(df_clean['zwyx_max'])

Removing Duplicates

Duplicate job postings are identified by the URL field and removed.

# Check for duplicate URLs
 duplicates = df_clean.duplicated(subset='zw_link')
print('Duplicates found:', duplicates.sum())
# Keep only unique entries
 df_clean = df_clean[~duplicates]

Salary Distribution Across China

The cleaned data is used to compute the number of postings per city, rank the top 10 cities, and visualise the distribution with a pie chart and a histogram of minimum salaries.

# List of major cities to consider
 ADDRESS = ['北京','上海','广州','深圳','天津','武汉','西安','成都','大连','长春','沈阳','南京','济南','青岛','杭州','苏州','无锡','宁波','重庆','郑州','长沙','福州','厦门','哈尔滨','石家庄','合肥','惠州','太原','昆明','烟台','佛山','南昌','贵阳','南宁']

# Extract city name from location field
 df_city = df_clean.copy()
 for city in ADDRESS:
     df_city['gzdd'] = df_city['gzdd'].replace(city + '.*', city, regex=True)

# Count postings per city
 df_city_counts = df_city.groupby('gzdd')[['zwmc','gsmc']].count()
 df_city_counts['percentage'] = (df_city_counts['zwmc'] / df_city_counts['zwmc'].sum() * 100).round(2)
 df_city_counts = df_city_counts.rename(columns={'zwmc':'number'}).reset_index()
 df_city_counts['label'] = df_city_counts['gzdd'] + ' ' + df_city_counts['percentage'].astype(str) + '%'

# Plot pie chart of city distribution
 import matplotlib.pyplot as plt
 sizes = df_city_counts['number']
 labels = None  # omit labels for clarity
 plt.figure(figsize=(10,6))
 plt.pie(sizes, colors=plt.cm.PiYG(np.arange(len(sizes))/len(sizes)), startangle=0, shadow=False)
 plt.title('职位数量分布', loc='center')
 plt.savefig('job_distribute.jpg')
 plt.show()

A histogram of the minimum monthly salary is also generated.

# Histogram of minimum salary
 bins = [3000,6000,9000,12000,15000,18000,21000,24000,100000]
 plt.figure(figsize=(10,8))
 plt.hist(df_clean['zwyx_min'], bins=bins, density=True, histtype='bar', facecolor='g', rwidth=0.8)
 plt.title('Hist of min monthly salary in China', size=14)
 plt.xlabel('min monthly salary (RMB)')
 plt.ylabel('Frequency')
 plt.savefig('salary_quanguo_min.jpg')
 plt.show()

City‑Specific Analysis (Beijing and Changsha)

Separate DataFrames are created for Beijing and Changsha to examine local salary trends and skill requirements.

# Beijing specific data
 df_beijing = df_clean[df_clean['gzdd'].str.contains('北京.*', regex=True)]
 df_beijing.to_excel('zhilian_kw_python_bj.xlsx')
 print('Beijing rows:', df_beijing.shape[0])

# Changsha specific data
 df_changsha = df_clean[df_clean['gzdd'].str.contains('长沙.*', regex=True)]
 df_changsha.to_excel('zhilian_kw_python_cs.xlsx')
 print('Changsha rows:', df_changsha.shape[0])

Skill Word Cloud Generation

The job description field brief is concatenated, tokenised with jieba, and visualised as a word cloud.

import jieba
from wordcloud import WordCloud, ImageColorGenerator
import matplotlib.pyplot as plt
import os
from PIL import Image
import numpy as np

# Load concatenated brief text
 with open('brief_quanguo.txt','rb') as f:
     text = f.read().decode('utf-8')

# Tokenise Chinese text
 wordlist = jieba.cut(text, cut_all=False)
 wordlist_space_split = ' '.join(wordlist)

# Load mask image for word cloud shape
 d = os.path.dirname(__file__)
 mask = np.array(Image.open(os.path.join(d,'colors.png')))

wc = WordCloud(background_color='#F0F8FF', max_words=100, mask=mask, max_font_size=300, random_state=42)
wc.generate(wordlist_space_split)
image_colors = ImageColorGenerator(mask)

plt.imshow(wc.recolor(color_func=image_colors), interpolation='bilinear')
plt.axis('off')
plt.show()
wc.to_file(os.path.join(d,'brief_quanguo_colors_cloud.png'))

Conclusion

The analysis reveals that Beijing, Shanghai, and Shenzhen dominate the Python job market in China, with average minimum salaries ranging from 3,000 to 12,000 RMB. The word‑cloud visualisation highlights frequently demanded skills such as "爬虫" (web crawling), "Django", and "数据分析" (data analysis). The provided Python scripts can be adapted to other datasets for similar market‑trend investigations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Salary Job market visualization word cloud

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.