Master Baidu Keyword Scraping with Python and XPath – Step-by-Step Guide
Learn how to scrape Baidu search result titles and URLs using Python's requests library and lxml's xpath parsing, complete with a ready-to-run script, CSV output, and step-by-step explanation that builds on earlier regex and BeautifulSoup methods.
1. Introduction
In previous articles we extracted Baidu keywords and links using regular expressions and bs4. This article demonstrates how to achieve the same extraction using xpath with Python.
2. Implementation
Below is the complete script.
# coding:utf-8
# @Time : 2022/4/21 15:03
# @Author: 皮皮
# @公众号: Python共享之家
# @website : http://pdcfighting.com/
# @File : 百度关键词爬虫(xpath解析).py.py
# @Software: PyCharm
# -*- coding: utf-8 -*-
# @Time : 2022/4/19 0019 18:24
# @Author : 皮皮:Python共享之家
# @File : demo.py
import requests
from fake_useragent import UserAgent
import re
from lxml import etree
def get_web_page(wd, pn):
url = 'https://www.baidu.com/s'
ua = UserAgent()
# print(ua)
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'User-agent': ua.random,
'Cookie': 'BAIDUID=... (cookie omitted for brevity) ...',
'Host': 'www.baidu.com'
}
params = {
'wd': wd,
'pn': pn
}
response = requests.get(url, headers=headers, params=params)
response.encoding = 'utf-8'
# print(response.text)
response = response.text
return response
def parse_page(response):
html = etree.HTML(response)
selectors = html.xpath('//div[@class="c-container"]')
data = []
nub = 0
for selector in selectors:
title = "".join(selector.xpath('.//h3/a//text()'))
titleUrl = selector.xpath('.//h3/a/@href')[0]
print(title)
print(titleUrl)
nub += 1
data.append([title, titleUrl])
print(f"当前页一共有{nub}条标题和网址的信息!")
return data
def save_data(datas, kw, page):
for data in datas:
with open(f'./百度{kw}的第{page}页的数据(xpath).csv', 'a', encoding='utf-8') as fp:
fp.write(str(data) + '
')
print(f"百度{kw}的第{page}页的数据已经成功保存!")
def main():
kw = input("请输入要查询的关键词:").strip()
page = input("请输入要查询的页码:").strip()
page_pn = int(page)
page_pn = str(page_pn * 10 - 10)
resp = get_web_page(kw, page_pn)
datas = parse_page(resp)
save_data(datas, kw, page)
if __name__ == '__main__':
main()The script runs successfully and prints the extracted titles and URLs.
It also automatically creates a CSV file locally to store the collected data.
3. Conclusion
This article shares a functional Python web‑scraping script that fetches Baidu search result titles and URLs via xpath. It complements earlier tutorials that used regex and bs4, encouraging readers to try the method and explore further.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
