Fundamentals 9 min read

How to Scrape Lagou Python Job Data and Visualize Trends with Python

This tutorial demonstrates how to collect Python job postings from Lagou using Python's requests library, process the JSON response with pandas, and create insightful visualizations—including bar charts, word clouds, and geographic heatmaps—while handling anti‑scraping measures and data cleaning steps.

MaGe Linux Operations

May 8, 2018

How to Scrape Lagou Python Job Data and Visualize Trends with Python

Full Introduction

This article shows how to gather Python job data from Lagou, then visualize it using Python. It covers web scraping with requests, data extraction with re, and analysis with pandas, matplotlib, wordcloud, and pyecharts.

Web Scraping Section

Lagou's job list is loaded via a POST request. The real URL is

https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false&isSchoolJob=0

, where kd is the keyword and pn is the page number.

import requests
import re
import time
import random

# POST URL
url = 'https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false&isSchoolJob=0'

# Headers to bypass anti‑scraping
header = {
    'Host': 'www.lagou.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36',
    'Accept': 'application/json, text/javascript, */*; q=0.01',
    'Accept-Language': 'zh-CN,en-US;q=0.7,en;q=0.3',
    'Accept-Encoding': 'gzip, deflate, br',
    'Referer': 'https://www.lagou.com/jobs/list_Python?labelWords=&fromSearch=true&suginput=',
    'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
    'X-Requested-With': 'XMLHttpRequest',
    'X-Anit-Forge-Token': 'None',
    'X-Anit-Forge-Code': '0',
    'Content-Length': '26',
    'Cookie': 'user_trace_token=...; JSESSIONID=...; ...',
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache'
}

for n in range(30):
    # Data to submit for each page
    form = {'first': 'false', 'kd': 'Python', 'pn': str(n)}
    time.sleep(random.randint(2, 5))
    html = requests.post(url, data=form, headers=header)
    # Extract required fields using regex
    data = re.findall('{"companyId":.*?"positionName":"(.*?)","workYear":"(.*?)","education":"(.*?)","jobNature":"(.*?)","financeStage":"(.*?)","companyLogo":".*?","industryField":".*?","city":"(.*?)","salary":"(.*?)","positionId":.*?"positionAdvantage":"(.*?)","companyShortName":"(.*?)","district"', html.text)
    # Convert to DataFrame and save
    df = pd.DataFrame(data)
    df.to_csv(r'D:\Windows 7 Documents\Desktop\My\LaGouDataPython.csv', header=False, index=False, mode='a+')

Note: Limit request speed with time.sleep to avoid being blocked. No login is required.

Data Visualization

After downloading, the CSV looks like the following (columns added manually):

Import required libraries and set plot styles:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import jieba
from wordcloud import WordCloud
from pyecharts import Geo
import matplotlib as mpl

mpl.rcParams["font.sans-serif"] = ["Microsoft YaHei"]
plt.rcParams["axes.labelsize"] = 16.
plt.rcParams["xtick.labelsize"] = 14.
plt.rcParams["ytick.labelsize"] = 14.
plt.rcParams["legend.fontsize"] = 12.
plt.rcParams["figure.figsize"] = [15., 15.]

Examples of visualizations:

Education requirement bar chart:

data['学历要求'].value_counts().plot(kind='barh', rot=0)
plt.show()

Work experience bar chart:

data['工作经验'].value_counts().plot(kind='bar', rot=0, color='b')
plt.show()

Word cloud of popular Python positions:

final = ''
stopwords = ['PYTHON','python','Python','工程师','（','）','/']
for n in range(data.shape[0]):
    seg_list = list(jieba.cut(data['岗位职称'][n]))
    for seg in seg_list:
        if seg not in stopwords:
            final = final + seg + ' '
# final now contains the words for the word cloud

Geographic heatmap of average salaries by city:

# Extract salary (in RMB) and city
data2 = list(map(lambda x: (data['工作地点'][x], eval(re.split('k|K', data['工资'][x])[0]) * 1000), range(len(data))))
# Convert to DataFrame
data3 = pd.DataFrame(data2)
# Group by city and compute mean salary
data4 = list(map(lambda x: (data3.groupby(0).mean().index[x], data3.groupby(0).mean().values[x]), range(len(data3.groupby(0)))))
# Create heatmap with pyecharts
geo = Geo("全国Python工资布局", "制作人:挖掘机小王子", title_color="#fff", title_pos="left", width=1200, height=600, background_color="#404a59")
attr, value = geo.cast(data4)
geo.add("", attr, value, type="heatmap", is_visualmap=True, visual_range=[0, 300], visual_text_color="#fff")
geo.render()

Important notes: avoid using Chinese characters in file paths when reading CSV, install wordcloud manually if pip fails due to missing C++14, and respect anti‑scraping limits.

Author: 挖掘机小王子 (source: zhihu.com/people/WaJueJiPrince)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Visualization Web Scraping Matplotlib Pyecharts Lagou

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.