Backend Development 6 min read

How to Scrape Baidu Keywords and Links with Python and XPath

This tutorial demonstrates a complete Python script that uses requests and lxml's XPath to fetch Baidu search result titles and URLs, saves them to a CSV file, and includes sample output screenshots for verification.

Python Crawling & Data Mining

May 5, 2022

How to Scrape Baidu Keywords and Links with Python and XPath

In previous articles the author extracted Baidu keywords and links using regular expressions and BeautifulSoup; this article introduces extracting the same data with XPath.

Implementation Process

The full Python code below sends a request to Baidu, parses the HTML with lxml, extracts titles and URLs via XPath, and writes the results to a CSV file.

# coding:utf-8

# @Time : 2022/4/21 15:03
# @Author: PiPi
# @Public Account: Python Sharing Home
# @Website : http://pdcfighting.com/
# @File : BaiduKeywordCrawler(xpath).py
# @Software: PyCharm

import requests
from fake_useragent import UserAgent
import re
from lxml import etree

def get_web_page(wd, pn):
    url = 'https://www.baidu.com/s'
    ua = UserAgent()
    headers = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'User-agent': ua.random,
        'Cookie': '... (omitted for brevity) ...',
        'Host': 'www.baidu.com'
    }
    params = {'wd': wd, 'pn': pn}
    response = requests.get(url, headers=headers, params=params)
    response.encoding = 'utf-8'
    return response.text

def parse_page(response):
    html = etree.HTML(response)
    selectors = html.xpath('//div[@class="c-container"]')
    data = []
    nub = 0
    for selector in selectors:
        title = "".join(selector.xpath('.//h3/a//text()'))
        titleUrl = selector.xpath('.//h3/a/@href')[0]
        print(title)
        print(titleUrl)
        nub += 1
        data.append([title, titleUrl])
    print(f"当前页一共有{nub}条标题和网址的信息！")
    return data

def save_data(datas, kw, page):
    for data in datas:
        with open(f'./百度{kw}的第{page}页的数据(xpath).csv', 'a', encoding='utf-8') as fp:
            fp.write(str(data) + '
')
    print(f"百度{kw}的第{page}页的数据已经成功保存！")

def main():
    kw = input("请输入要查询的关键词：").strip()
    page = input("请输入要查询的页码：").strip()
    page_pn = int(page)
    page_pn = str(page_pn * 10 - 10)
    resp = get_web_page(kw, page_pn)
    datas = parse_page(resp)
    save_data(datas, kw, page)

if __name__ == '__main__':
    main()

Running the script produces the extracted titles and URLs as shown in the following screenshot.

The script also generates a CSV file containing the collected data, illustrated below.

Conclusion

This article shares a functional Python web‑scraping solution that leverages requests, fake_useragent, and lxml XPath to retrieve Baidu search result titles and links, complementing earlier approaches based on regular expressions and BeautifulSoup.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python CSV Baidu web-scraping lxml

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.