How to Scrape Baidu Keywords and Links with Python & BeautifulSoup

This tutorial demonstrates how to use Python's requests library together with BeautifulSoup to crawl Baidu search results, extract titles and URLs, and save the data into a CSV file, providing complete code and step‑by‑step explanations.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
How to Scrape Baidu Keywords and Links with Python & BeautifulSoup

1. Introduction

A fan shared a Python web‑scraping script that extracts Baidu search keywords and links using regular expressions. In this article we replace the regex approach with bs4 (BeautifulSoup) for more reliable extraction.

2. Implementation

The full script is shown below. It sends requests to Baidu, parses the result page with BeautifulSoup, collects titles and URLs, and writes them to a CSV file.

# -*- coding: utf-8 -*-
# @Time    : 2022/4/20  18:24
# @Author  : PiPi: Python Sharing Home
# @File    : demo.py

import requests
from bs4 import BeautifulSoup
import time
import pandas as pd

# Convert encrypted Baidu URL to real URL
def convert_url(url):
    resp = requests.get(url=url, headers=headers, allow_redirects=False)
    return resp.headers['Location']

# Get URLs for a given keyword and number of pages
def get_url(wd, num):
    s = requests.session()
    total_title = []
    total_url = []
    total_info = []
    # Page calculation: first page <10, second page 10, etc.
    num = num * 10 - 10
    for i in range(-10, num, 10):
        url = 'https://www.baidu.com/s'
        params = {"wd": wd, "pn": i}
        r = s.get(url=url, headers=headers, params=params)
        print("返回状态码:", r.status_code)
        soup = BeautifulSoup(r.text, 'lxml')
        for so in soup.select('#content_left .t a'):
            g_url = so.get('href')
            g_title = so.get_text().replace('
', '').strip()
            print(g_title, g_url)
            total_title.append(g_title)
            total_url.append(g_url)
        time.sleep(1 + (i / 10))
        print("当前页码:", (i + 10) / 10 + 1)
    try:
        total_info = zip(total_title, total_url)
        df = pd.DataFrame(data=total_info, columns=['标题', 'Url'])
        df.to_csv('./web_data.csv', index=False, encoding='utf_8_sig')
        print("保存成功")
    except:
        return 'FALSE'

if __name__ == '__main__':
    while True:
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0",
            "Host": "www.baidu.com",
        }
        wd = input("输入搜索内容:")
        num = int(input("输入页数:"))
        get_url(wd, num)

Running the script produces a screenshot of the console output and automatically generates a csv file named web_data.csv containing the extracted titles and URLs.

Console output screenshot
Console output screenshot

The resulting CSV file looks like the image below.

CSV file preview
CSV file preview

3. Conclusion

This article shares a functional Python web‑scraping script that fetches Baidu search results using BeautifulSoup, which is more effective than the previous regex‑based method. The next article will demonstrate extraction with xpath. Feel free to try it and learn together.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonTutorialCSVBaiduweb-scraping
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.