Backend Development 7 min read

How to Build a Baidu Web Scraper in Python: Step‑by‑Step Guide

This article walks through a Python‑based Baidu web‑scraping solution, explaining the problem, showing the debugging process with screenshots, and providing a complete, runnable script that extracts keywords, titles, URLs, descriptions, and site names into a CSV file.

Python Crawling & Data Mining

May 3, 2023

How to Build a Baidu Web Scraper in Python: Step‑by‑Step Guide

1. Introduction

The author shares a Python web‑scraping issue raised in a community chat, describing how the Baidu search results page structure changed and caused the original crawler to fail.

2. Implementation Process

Guidance from community members helped identify the structural changes. Screenshots illustrate the problematic page and the corrected HTML elements.

The corrected code successfully extracts the desired data.

import os
import random
import time
import pandas as pd
import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 Edg/109.0.1518.70",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6",
    "Connection": "keep-alive",
    "Accept-Encoding": "gzip, deflate",
    "Host": "www.baidu.com",
    "Cookie": "BIDUPSID=...; PSTM=...; delPer=0; BD1"
}

def baidu_search(v_keyword, v_result_file, v_max_page):
    """Search Baidu and save results to CSV.
    :param v_keyword: search keyword
    :param v_result_file: output CSV file name
    :param v_max_page: number of result pages to crawl
    """
    for page in range(v_max_page):
        print('开始爬取第{}页'.format(page + 1))
        url = 'https://www.baidu.com/s?&wd=' + v_keyword + '&pn=' + str(page * 10)
        r = requests.get(url, headers=headers)
        html = r.text
        soup = BeautifulSoup(html)
        result_list = soup.find_all(class_='result c-container xpath-log new-pmd')
        print('正在读取：{}，共查询到{}个结果'.format(url, len(result_list)))
        kw_list, page_list, title_list, href_list, desc_list, site_list = [], [], [], [], [], []
        for result in result_list:
            title = result.find('a').text
            print('title is: ', title)
            href = result.find('a')['href']
            try:
                desc = result.find(class_="c-container").text
            except:
                desc = ""
            try:
                site = result.find(class_="c-color-gray").text
            except:
                site = ""
            kw_list.append(v_keyword)
            page_list.append(page + 1)
            title_list.append(title)
            href_list.append(href)
            desc_list.append(desc)
            site_list.append(site)
        df = pd.DataFrame({
            '关键词': kw_list,
            '页码': page_list,
            '标题': title_list,
            '百度链接': href_list,
            '简介': desc_list,
            '网站名称': site_list,
        })
        if os.path.exists(v_result_file):
            header = None
        else:
            header = ['关键词', '页码', '标题', '百度链接', '简介', '网站名称']
        df.to_csv(v_result_file, mode='a+', index=False, header=header, encoding='utf_8_sig')
        print('结果保存成功：{}'.format(v_result_file))

if __name__ == '__main__':
    search_keyword = '地铁故障起火'
    max_page = 20
    result_file = f'百度爬虫{search_keyword}_前{max_page}页.csv'
    if os.path.exists(result_file):
        os.remove(result_file)
        print('结果文件({})存在，已删除'.format(result_file))
    baidu_search(search_keyword, result_file, max_page)

3. Conclusion

The article demonstrates how to diagnose a broken Python web scraper, adjust the parsing logic to match the updated page structure, and export the collected data to a CSV file, providing a practical example for anyone facing similar crawling challenges.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Pandas Baidu beautifulsoup web-scraping data-extraction

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.