Master Baidu Keyword Scraping with Python and XPath – Step-by-Step Guide

Learn how to scrape Baidu search result titles and URLs using Python's requests library and lxml's xpath parsing, complete with a ready-to-run script, CSV output, and step-by-step explanation that builds on earlier regex and BeautifulSoup methods.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
Master Baidu Keyword Scraping with Python and XPath – Step-by-Step Guide

1. Introduction

In previous articles we extracted Baidu keywords and links using regular expressions and bs4. This article demonstrates how to achieve the same extraction using xpath with Python.

2. Implementation

Below is the complete script.

# coding:utf-8

# @Time : 2022/4/21 15:03
# @Author: 皮皮
# @公众号: Python共享之家
# @website : http://pdcfighting.com/
# @File : 百度关键词爬虫(xpath解析).py.py
# @Software: PyCharm

# -*- coding: utf-8 -*-
# @Time    : 2022/4/19 0019 18:24
# @Author  : 皮皮:Python共享之家
# @File    : demo.py

import requests
from fake_useragent import UserAgent
import re
from lxml import etree


def get_web_page(wd, pn):
    url = 'https://www.baidu.com/s'
    ua = UserAgent()
    # print(ua)
    headers = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'User-agent': ua.random,
        'Cookie': 'BAIDUID=... (cookie omitted for brevity) ...',
        'Host': 'www.baidu.com'
    }
    params = {
        'wd': wd,
        'pn': pn
    }
    response = requests.get(url, headers=headers, params=params)
    response.encoding = 'utf-8'
    # print(response.text)
    response = response.text
    return response


def parse_page(response):
    html = etree.HTML(response)
    selectors = html.xpath('//div[@class="c-container"]')
    data = []
    nub = 0
    for selector in selectors:
        title = "".join(selector.xpath('.//h3/a//text()'))
        titleUrl = selector.xpath('.//h3/a/@href')[0]
        print(title)
        print(titleUrl)
        nub += 1
        data.append([title, titleUrl])
    print(f"当前页一共有{nub}条标题和网址的信息!")
    return data


def save_data(datas, kw, page):
    for data in datas:
        with open(f'./百度{kw}的第{page}页的数据(xpath).csv', 'a', encoding='utf-8') as fp:
            fp.write(str(data) + '
')
    print(f"百度{kw}的第{page}页的数据已经成功保存!")


def main():
    kw = input("请输入要查询的关键词:").strip()
    page = input("请输入要查询的页码:").strip()
    page_pn = int(page)
    page_pn = str(page_pn * 10 - 10)
    resp = get_web_page(kw, page_pn)
    datas = parse_page(resp)
    save_data(datas, kw, page)

if __name__ == '__main__':
    main()

The script runs successfully and prints the extracted titles and URLs.

It also automatically creates a CSV file locally to store the collected data.

3. Conclusion

This article shares a functional Python web‑scraping script that fetches Baidu search result titles and URLs via xpath. It complements earlier tutorials that used regex and bs4, encouraging readers to try the method and explore further.

PythonCSVweb-scrapingBaiduXPathlxml
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.