Python Web Scraping Tutorial: Extracting Job Listings from Lagou.com and Saving to Excel
This tutorial demonstrates how to use Python, requests, and openpyxl to analyze a job‑listing website, construct the appropriate Ajax URLs, write a scraper that parses JSON responses, and export the collected position data into an Excel workbook.
In many work scenarios we need to gather information from the web, and manually searching and organizing it can be time‑consuming; Python web‑scraping can automate this process.
The example focuses on extracting job data from the Chinese recruitment site Lagou.com. The environment used is Windows 10, Python 3, and Jupyter Notebook.
Step 1: Analyze the web page – Modern sites often load content via Ajax, so the initial HTML may be empty. Using Chrome’s network inspector, locate XHR requests that return JSON data containing the job listings.
Step 2: Construct the request URL – By examining the request headers and parameters, identify the fixed part of the URL (e.g., http://www.lagou.com/jobs/positionAjax.json) and variable query parameters such as city, pn (page number), and kd (keyword).
Step 3: Write the scraper script – The script sends POST requests with appropriate headers, parses the returned JSON, extracts fields like company ID, name, size, industry, position title, salary, etc., and stores them in a list.
import requests,json
from openpyxl import Workbook
# HTTP request headers
headers={
'Accept':'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'zh-CN,zh;q=0.8',
'Connection':'keep-alive',
'Content-Length':'25',
'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
'Cookie':'... (omitted for brevity) ...',
'Host':'www.lagou.com',
'Origin':'https://www.lagou.com',
'Referer':'https://www.lagou.com/jobs/list_Python?',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
'X-Anit-Forge-Code':'0',
'X-Anit-Forge-Token':'None',
'X-Requested-With':'XMLHttpRequest'
}
def get_json(url, page, lang_name):
data = {'first': "true", 'pn': page, 'kd': lang_name,'city':"北京"}
# POST request
json_resp = requests.post(url, data, headers=headers).json()
list_con = json_resp['content']['positionResult']['result']
info_list = []
for i in list_con:
info = []
info.append(i['companyId'])
info.append(i['companyFullName'])
info.append(i['companyShortName'])
info.append(i['companySize'])
info.append(str(i['companyLabelList']))
info.append(i['industryField'])
info.append(i['financeStage'])
info.append(i['positionId'])
info.append(i['positionName'])
info.append(i['positionAdvantage'])
info.append(i['city'])
info.append(i['district'])
info.append(i['salary'])
info.append(i['education'])
info.append(i['workYear'])
info_list.append(info)
return info_list
def main():
lang_name = input('职位名:')
page = 1
url = 'http://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'
info_result=[]
title = ['公司ID','公司全名','公司简称','公司规模','公司标签','行业领域','融资情况','职位编号','职位名称','职位优势','城市','区域','薪资水平','教育程度','工作经验']
info_result.append(title)
while page < 31:
info = get_json(url, page, lang_name)
info_result = info_result + info
page += 1
wb = Workbook()
ws1 = wb.active
ws1.title = lang_name
for row in info_result:
ws1.append(row)
wb.save('职位信息3.xlsx')
main()After the loop finishes (30 pages, 15 entries per page), the script writes all collected records into an Excel file, which can be opened to verify that the job data has been successfully saved.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
