How to Scrape Lagou Python Job Data and Visualize Trends with Python
This tutorial demonstrates how to collect Python job postings from Lagou using Python's requests library, process the JSON response with pandas, and create insightful visualizations—including bar charts, word clouds, and geographic heatmaps—while handling anti‑scraping measures and data cleaning steps.
Full Introduction
This article shows how to gather Python job data from Lagou, then visualize it using Python. It covers web scraping with requests, data extraction with re, and analysis with pandas, matplotlib, wordcloud, and pyecharts.
Web Scraping Section
Lagou's job list is loaded via a POST request. The real URL is
https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false&isSchoolJob=0, where kd is the keyword and pn is the page number.
import requests
import re
import time
import random
# POST URL
url = 'https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false&isSchoolJob=0'
# Headers to bypass anti‑scraping
header = {
'Host': 'www.lagou.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'zh-CN,en-US;q=0.7,en;q=0.3',
'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://www.lagou.com/jobs/list_Python?labelWords=&fromSearch=true&suginput=',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'X-Requested-With': 'XMLHttpRequest',
'X-Anit-Forge-Token': 'None',
'X-Anit-Forge-Code': '0',
'Content-Length': '26',
'Cookie': 'user_trace_token=...; JSESSIONID=...; ...',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache'
}
for n in range(30):
# Data to submit for each page
form = {'first': 'false', 'kd': 'Python', 'pn': str(n)}
time.sleep(random.randint(2, 5))
html = requests.post(url, data=form, headers=header)
# Extract required fields using regex
data = re.findall('{"companyId":.*?"positionName":"(.*?)","workYear":"(.*?)","education":"(.*?)","jobNature":"(.*?)","financeStage":"(.*?)","companyLogo":".*?","industryField":".*?","city":"(.*?)","salary":"(.*?)","positionId":.*?"positionAdvantage":"(.*?)","companyShortName":"(.*?)","district"', html.text)
# Convert to DataFrame and save
df = pd.DataFrame(data)
df.to_csv(r'D:\Windows 7 Documents\Desktop\My\LaGouDataPython.csv', header=False, index=False, mode='a+')Note: Limit request speed with time.sleep to avoid being blocked. No login is required.
Data Visualization
After downloading, the CSV looks like the following (columns added manually):
Import required libraries and set plot styles:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import jieba
from wordcloud import WordCloud
from pyecharts import Geo
import matplotlib as mpl
mpl.rcParams["font.sans-serif"] = ["Microsoft YaHei"]
plt.rcParams["axes.labelsize"] = 16.
plt.rcParams["xtick.labelsize"] = 14.
plt.rcParams["ytick.labelsize"] = 14.
plt.rcParams["legend.fontsize"] = 12.
plt.rcParams["figure.figsize"] = [15., 15.]Examples of visualizations:
Education requirement bar chart:
data['学历要求'].value_counts().plot(kind='barh', rot=0)
plt.show()Work experience bar chart:
data['工作经验'].value_counts().plot(kind='bar', rot=0, color='b')
plt.show()Word cloud of popular Python positions:
final = ''
stopwords = ['PYTHON','python','Python','工程师','(',')','/']
for n in range(data.shape[0]):
seg_list = list(jieba.cut(data['岗位职称'][n]))
for seg in seg_list:
if seg not in stopwords:
final = final + seg + ' '
# final now contains the words for the word cloudGeographic heatmap of average salaries by city:
# Extract salary (in RMB) and city
data2 = list(map(lambda x: (data['工作地点'][x], eval(re.split('k|K', data['工资'][x])[0]) * 1000), range(len(data))))
# Convert to DataFrame
data3 = pd.DataFrame(data2)
# Group by city and compute mean salary
data4 = list(map(lambda x: (data3.groupby(0).mean().index[x], data3.groupby(0).mean().values[x]), range(len(data3.groupby(0)))))
# Create heatmap with pyecharts
geo = Geo("全国Python工资布局", "制作人:挖掘机小王子", title_color="#fff", title_pos="left", width=1200, height=600, background_color="#404a59")
attr, value = geo.cast(data4)
geo.add("", attr, value, type="heatmap", is_visualmap=True, visual_range=[0, 300], visual_text_color="#fff")
geo.render()Important notes: avoid using Chinese characters in file paths when reading CSV, install wordcloud manually if pip fails due to missing C++14, and respect anti‑scraping limits.
Author: 挖掘机小王子 (source: zhihu.com/people/WaJueJiPrince)
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
