How to Scrape Baidu Keywords and Links with Python and XPath
This tutorial demonstrates a complete Python script that uses requests and lxml's XPath to fetch Baidu search result titles and URLs, saves them to a CSV file, and includes sample output screenshots for verification.
In previous articles the author extracted Baidu keywords and links using regular expressions and BeautifulSoup; this article introduces extracting the same data with XPath.
Implementation Process
The full Python code below sends a request to Baidu, parses the HTML with lxml, extracts titles and URLs via XPath, and writes the results to a CSV file.
# coding:utf-8
# @Time : 2022/4/21 15:03
# @Author: PiPi
# @Public Account: Python Sharing Home
# @Website : http://pdcfighting.com/
# @File : BaiduKeywordCrawler(xpath).py
# @Software: PyCharm
import requests
from fake_useragent import UserAgent
import re
from lxml import etree
def get_web_page(wd, pn):
url = 'https://www.baidu.com/s'
ua = UserAgent()
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'User-agent': ua.random,
'Cookie': '... (omitted for brevity) ...',
'Host': 'www.baidu.com'
}
params = {'wd': wd, 'pn': pn}
response = requests.get(url, headers=headers, params=params)
response.encoding = 'utf-8'
return response.text
def parse_page(response):
html = etree.HTML(response)
selectors = html.xpath('//div[@class="c-container"]')
data = []
nub = 0
for selector in selectors:
title = "".join(selector.xpath('.//h3/a//text()'))
titleUrl = selector.xpath('.//h3/a/@href')[0]
print(title)
print(titleUrl)
nub += 1
data.append([title, titleUrl])
print(f"当前页一共有{nub}条标题和网址的信息!")
return data
def save_data(datas, kw, page):
for data in datas:
with open(f'./百度{kw}的第{page}页的数据(xpath).csv', 'a', encoding='utf-8') as fp:
fp.write(str(data) + '
')
print(f"百度{kw}的第{page}页的数据已经成功保存!")
def main():
kw = input("请输入要查询的关键词:").strip()
page = input("请输入要查询的页码:").strip()
page_pn = int(page)
page_pn = str(page_pn * 10 - 10)
resp = get_web_page(kw, page_pn)
datas = parse_page(resp)
save_data(datas, kw, page)
if __name__ == '__main__':
main()Running the script produces the extracted titles and URLs as shown in the following screenshot.
The script also generates a CSV file containing the collected data, illustrated below.
Conclusion
This article shares a functional Python web‑scraping solution that leverages requests, fake_useragent, and lxml XPath to retrieve Baidu search result titles and links, complementing earlier approaches based on regular expressions and BeautifulSoup.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
