How to Build a Baidu Web Scraper in Python: Step‑by‑Step Guide
This article walks through a Python‑based Baidu web‑scraping solution, explaining the problem, showing the debugging process with screenshots, and providing a complete, runnable script that extracts keywords, titles, URLs, descriptions, and site names into a CSV file.
1. Introduction
The author shares a Python web‑scraping issue raised in a community chat, describing how the Baidu search results page structure changed and caused the original crawler to fail.
2. Implementation Process
Guidance from community members helped identify the structural changes. Screenshots illustrate the problematic page and the corrected HTML elements.
The corrected code successfully extracts the desired data.
import os
import random
import time
import pandas as pd
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 Edg/109.0.1518.70",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6",
"Connection": "keep-alive",
"Accept-Encoding": "gzip, deflate",
"Host": "www.baidu.com",
"Cookie": "BIDUPSID=...; PSTM=...; delPer=0; BD1"
}
def baidu_search(v_keyword, v_result_file, v_max_page):
"""Search Baidu and save results to CSV.
:param v_keyword: search keyword
:param v_result_file: output CSV file name
:param v_max_page: number of result pages to crawl
"""
for page in range(v_max_page):
print('开始爬取第{}页'.format(page + 1))
url = 'https://www.baidu.com/s?&wd=' + v_keyword + '&pn=' + str(page * 10)
r = requests.get(url, headers=headers)
html = r.text
soup = BeautifulSoup(html)
result_list = soup.find_all(class_='result c-container xpath-log new-pmd')
print('正在读取:{},共查询到{}个结果'.format(url, len(result_list)))
kw_list, page_list, title_list, href_list, desc_list, site_list = [], [], [], [], [], []
for result in result_list:
title = result.find('a').text
print('title is: ', title)
href = result.find('a')['href']
try:
desc = result.find(class_="c-container").text
except:
desc = ""
try:
site = result.find(class_="c-color-gray").text
except:
site = ""
kw_list.append(v_keyword)
page_list.append(page + 1)
title_list.append(title)
href_list.append(href)
desc_list.append(desc)
site_list.append(site)
df = pd.DataFrame({
'关键词': kw_list,
'页码': page_list,
'标题': title_list,
'百度链接': href_list,
'简介': desc_list,
'网站名称': site_list,
})
if os.path.exists(v_result_file):
header = None
else:
header = ['关键词', '页码', '标题', '百度链接', '简介', '网站名称']
df.to_csv(v_result_file, mode='a+', index=False, header=header, encoding='utf_8_sig')
print('结果保存成功:{}'.format(v_result_file))
if __name__ == '__main__':
search_keyword = '地铁故障起火'
max_page = 20
result_file = f'百度爬虫{search_keyword}_前{max_page}页.csv'
if os.path.exists(result_file):
os.remove(result_file)
print('结果文件({})存在,已删除'.format(result_file))
baidu_search(search_keyword, result_file, max_page)3. Conclusion
The article demonstrates how to diagnose a broken Python web scraper, adjust the parsing logic to match the updated page structure, and export the collected data to a CSV file, providing a practical example for anyone facing similar crawling challenges.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
