How to Scrape Baidu Tieba Titles and Images with Python and BeautifulSoup
This article demonstrates how to use Python's requests library and BeautifulSoup to crawl Baidu Tieba forums, extracting thread titles and associated images, explains why XPath may fail on irregular HTML, and provides a complete, runnable script with step-by-step implementation details.
Introduction
In a recent question from a fan in a Python community, the author was asked how to scrape Baidu Tieba thread titles and the images in the post content. The previous articles covered using regular expressions and XPath, but this piece shows how to achieve the same goal with BeautifulSoup (bs4).
Implementation Details
The response from Tieba is not well‑formed HTML, so XPath cannot reliably extract the data. Using requests to fetch the page and BeautifulSoup to parse the HTML works effectively.
The following script demonstrates the complete process:
# coding:utf-8
# @Time : 2022/5/3 10:46
# @Author: PiPi
# @Public Account: Python Sharing Home
# @Website : http://pdcfighting.com/
# @File : baidu_tieba.py
# @Software: PyCharm
import requests
from bs4 import BeautifulSoup
class TiebaSpider:
def __init__(self, name):
self.start_url = "https://tieba.baidu.com/f?kw=" + name + "&ie=utf-8&pn=0"
self.headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36",
"Cookie": "your_cookie"
}
def parse_url(self, url): # send request, get response
response = requests.get(url, headers=self.headers)
return response.content.decode()
def get_content_list(self, html_str):
# data hidden in comments, remove comment markers
html = html_str.replace('<!--', '').replace('-->', '')
html = BeautifulSoup(html, "lxml")
div_list = html.find_all('li', class_="j_thread_list clearfix thread_item_box")
print(len(div_list))
resp = []
for h in div_list:
title = h.find('div').find('a').text
img = h.find_all('img')
img = img[0].get('bpic') if img else ''
resp.append((title, img))
print(resp)
def run(self):
html_str = self.parse_url(self.start_url)
self.get_content_list(html_str)
if __name__ == '__main__':
tieba_spider = TiebaSpider("李毅")
tieba_spider.run()Result
The script runs successfully and prints a list of thread titles paired with their image URLs. An example screenshot of the output is shown below.
Conclusion
This article shares a practical Python web‑scraping solution for Baidu Tieba using BeautifulSoup, complementing earlier methods based on regular expressions and XPath. Readers are encouraged to try the code, adapt it to other forums, and continue learning web data extraction techniques.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
