Backend Development 5 min read

How to Scrape Baidu for Grain Policy Articles with Python: A Step‑by‑Step Guide

This article walks you through building a Python web‑scraper that queries Baidu for the keyword “grain,” extracts article titles, abstracts and links using Requests and lxml, saves the results to a text file, and shows the execution output with screenshots.

Python Crawling & Data Mining

Feb 17, 2021

How to Scrape Baidu for Grain Policy Articles with Python: A Step‑by‑Step Guide

Introduction

Hello, I am Cui Yanfei. Searching directly on Baidu often returns a flood of results mixed with ads, which can be time‑consuming to filter. A colleague needed titles and links of articles about grain policy from Baidu, so I decided to practice a small web‑scraper using Python.

Project Goal

Scrape Baidu search results for the keyword “grain,” store the titles, abstracts and URLs, and deliver the data for further analysis of China’s grain policies.

Project Preparation

Software: PyCharm

Required libraries: json, requests, lxml (etree)

Project Analysis

1) How to perform the keyword search?

Use the requests library to send a GET request to the Baidu search URL.

https://www.baidu.com/s?wd=粮食

2) How to obtain titles and links?

Parse the returned HTML with lxml.etree and locate the desired elements via XPath expressions.

3) How to save the results?

Create a .txt file and write the extracted data in a loop.

Implementation

Step 1: Import required libraries

import json
import requests
from lxml import etree

Step 2: Send the search request

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36"
}
response = requests.get('https://www.baidu.com/s?wd=粮食&lm=1', headers=headers)

Step 3: Parse the HTML and locate resources

r = response.text
html = etree.HTML(r, etree.HTMLParser())
r1 = html.xpath('//h3')
r2 = html.xpath('//*[ @class="c-abstract"]')
r3 = html.xpath('//*[ @class="t"]/a/@href')

Step 4: Loop through results and save them

for i in range(10):
    r11 = r1[i].xpath('string(.)')
    r22 = r2[i].xpath('string(.)')
    r33 = r3[i]
    with open('ok.txt', 'a', encoding='utf-8') as c:
        c.write(json.dumps(r11, ensure_ascii=False) + '
')
        c.write(json.dumps(r22, ensure_ascii=False) + '
')
        c.write(json.dumps(r33, ensure_ascii=False) + '
')
    print(r11)
    print('------------------------')
    print(r22)
    print(r33)

Result Demonstration

Program execution screenshot:

Saved .txt file preview:

Conclusion

This tutorial demonstrated how to use Python to crawl Baidu search results, extract useful information, and store it locally. It showcases the power of Python’s extensive libraries for quickly building practical data‑collection tools.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python data extraction Web Scraping Baidu requests lxml

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.