Backend Development 6 min read

Scrape Shandong Government Data with Python: Full Code Walkthrough

This article walks through a real‑world Python web‑scraping case, showing how to retrieve disease‑pest data from the Shandong government site using custom headers, cookies, and request parameters, and provides two complete code examples that successfully return the desired JSON results.

Python Crawling & Data Mining

May 1, 2023

Scrape Shandong Government Data with Python: Full Code Walkthrough

大家好，我是皮皮。

一、前言

前几天在Python铂金群问了一个关于 Python 网络爬虫处理的问题，这里分享解决过程。

二、实现过程

从返回的响应状态码来看是200，说明数据已获取，但在处理时可能出现问题。

粉丝截图了网页的数据，如下图所示：

以下是瑜亮老师提供的代码：

import requests

headers = {
    "Connection": "keep-alive",
    "Pragma": "no-cache",
    "Cache-Control": "no-cache",
    "Accept": "application/json, text/plain, */*",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36",
    "Referer": "http://nync.shandong.gov.cn/site/nynct/search.html?searchWord=%E7%97%85%E8%99%AB%E5%AE%B3%E5%8F%91%E7%97%85%E9%9D%A2%E7%A7%AF&siteId=4&pageSize=10",
    "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
    "Cookie": "这里放你的cookie"
}
url = "http://nync.shandong.gov.cn/igs/front/search.jhtml"
params = {
    "code": "1e80abdd60da4e7d8afeacc2e6c5a79d",
    "pageSize": "10",
    "queryAll": "true",
    "searchWord": "病虫害发病面积",
    "siteId": "4"
}
response = requests.get(url, headers=headers, cookies=cookies, params=params, verify=False)
resp = json.loads(response.text.replace('<em>', '').replace('</em>', '').replace(' ', '').replace('\
', '').replace('\u3000', ''))
results = resp["page"]["content"]
hrefs = [i['url'] for i in results]
print(hrefs)

该代码成功获取了所需数据。

另一个实现方案来自 eric：

import requests

headers = {
    "Accept": "application/json, text/plain, */*",
    "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6",
    "Connection": "keep-alive",
    "Referer": "http://nync.shandong.gov.cn/site/nynct/search.html?searchWord=%E7%97%85%E8%99%AB%E5%AE%B3%E5%8F%91%E7%97%85%E9%9D%A2%E7%A7%AF&siteId=4&pageSize=10",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.48"
}
cookies = {
    "JSESSIONID": "494022BEAC070E72F5F986E7E888ACAE",
    "token": "3c8f42b3-2e8b-4b5d-8f41-34238d7f9d22",
    "uuid": "3c8f42b3-2e8b-4b5d-8f41-34238d7f9d22"
}
url = "http://nync.shandong.gov.cn/igs/front/statistics/find_search_click_ranking.jhtml"
params = {"siteId": "4"}
response = requests.get(url, headers=headers, cookies=cookies, params=params, verify=False)
print(response.json())

该方案同样能够获取对应数据。

三、总结

本文展示了针对 Python 网络爬虫的具体问题，提供了两段完整代码实现，帮助读者成功获取山东省病虫害发病面积的统计数据。

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Backend Development data extraction Web Scraping requests

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.