How to Scrape Baidu Keywords and Links with Python & BeautifulSoup
This tutorial demonstrates how to use Python's requests library together with BeautifulSoup to crawl Baidu search results, extract titles and URLs, and save the data into a CSV file, providing complete code and step‑by‑step explanations.
1. Introduction
A fan shared a Python web‑scraping script that extracts Baidu search keywords and links using regular expressions. In this article we replace the regex approach with bs4 (BeautifulSoup) for more reliable extraction.
2. Implementation
The full script is shown below. It sends requests to Baidu, parses the result page with BeautifulSoup, collects titles and URLs, and writes them to a CSV file.
# -*- coding: utf-8 -*-
# @Time : 2022/4/20 18:24
# @Author : PiPi: Python Sharing Home
# @File : demo.py
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd
# Convert encrypted Baidu URL to real URL
def convert_url(url):
resp = requests.get(url=url, headers=headers, allow_redirects=False)
return resp.headers['Location']
# Get URLs for a given keyword and number of pages
def get_url(wd, num):
s = requests.session()
total_title = []
total_url = []
total_info = []
# Page calculation: first page <10, second page 10, etc.
num = num * 10 - 10
for i in range(-10, num, 10):
url = 'https://www.baidu.com/s'
params = {"wd": wd, "pn": i}
r = s.get(url=url, headers=headers, params=params)
print("返回状态码:", r.status_code)
soup = BeautifulSoup(r.text, 'lxml')
for so in soup.select('#content_left .t a'):
g_url = so.get('href')
g_title = so.get_text().replace('
', '').strip()
print(g_title, g_url)
total_title.append(g_title)
total_url.append(g_url)
time.sleep(1 + (i / 10))
print("当前页码:", (i + 10) / 10 + 1)
try:
total_info = zip(total_title, total_url)
df = pd.DataFrame(data=total_info, columns=['标题', 'Url'])
df.to_csv('./web_data.csv', index=False, encoding='utf_8_sig')
print("保存成功")
except:
return 'FALSE'
if __name__ == '__main__':
while True:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0",
"Host": "www.baidu.com",
}
wd = input("输入搜索内容:")
num = int(input("输入页数:"))
get_url(wd, num)Running the script produces a screenshot of the console output and automatically generates a csv file named web_data.csv containing the extracted titles and URLs.
The resulting CSV file looks like the image below.
3. Conclusion
This article shares a functional Python web‑scraping script that fetches Baidu search results using BeautifulSoup, which is more effective than the previous regex‑based method. The next article will demonstrate extraction with xpath. Feel free to try it and learn together.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
