How to Scrape Baidu Tieba Titles and Images with Python and BeautifulSoup

This article demonstrates how to use Python's requests library and BeautifulSoup to crawl Baidu Tieba forums, extracting thread titles and associated images, explains why XPath may fail on irregular HTML, and provides a complete, runnable script with step-by-step implementation details.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
How to Scrape Baidu Tieba Titles and Images with Python and BeautifulSoup

Introduction

In a recent question from a fan in a Python community, the author was asked how to scrape Baidu Tieba thread titles and the images in the post content. The previous articles covered using regular expressions and XPath, but this piece shows how to achieve the same goal with BeautifulSoup (bs4).

Implementation Details

The response from Tieba is not well‑formed HTML, so XPath cannot reliably extract the data. Using requests to fetch the page and BeautifulSoup to parse the HTML works effectively.

The following script demonstrates the complete process:

# coding:utf-8

# @Time : 2022/5/3 10:46
# @Author: PiPi
# @Public Account: Python Sharing Home
# @Website : http://pdcfighting.com/
# @File : baidu_tieba.py
# @Software: PyCharm
import requests
from bs4 import BeautifulSoup

class TiebaSpider:
    def __init__(self, name):
        self.start_url = "https://tieba.baidu.com/f?kw=" + name + "&ie=utf-8&pn=0"
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36",
            "Cookie": "your_cookie"
        }

    def parse_url(self, url):  # send request, get response
        response = requests.get(url, headers=self.headers)
        return response.content.decode()

    def get_content_list(self, html_str):
        # data hidden in comments, remove comment markers
        html = html_str.replace('<!--', '').replace('-->', '')
        html = BeautifulSoup(html, "lxml")
        div_list = html.find_all('li', class_="j_thread_list clearfix thread_item_box")
        print(len(div_list))

        resp = []
        for h in div_list:
            title = h.find('div').find('a').text
            img = h.find_all('img')
            img = img[0].get('bpic') if img else ''
            resp.append((title, img))
        print(resp)

    def run(self):
        html_str = self.parse_url(self.start_url)
        self.get_content_list(html_str)

if __name__ == '__main__':
    tieba_spider = TiebaSpider("李毅")
    tieba_spider.run()

Result

The script runs successfully and prints a list of thread titles paired with their image URLs. An example screenshot of the output is shown below.

Conclusion

This article shares a practical Python web‑scraping solution for Baidu Tieba using BeautifulSoup, complementing earlier methods based on regular expressions and XPath. Readers are encouraged to try the code, adapt it to other forums, and continue learning web data extraction techniques.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

regular expressionsrequestsXPathbeautifulsoupBaidu Tieba
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.