Backend Development 5 min read

How to Scrape Baidu Tieba Titles and Images with Python Regex (Step‑by‑Step)

This article explains why XPath fails on Baidu Tieba pages, demonstrates how to extract thread titles and image URLs using Python's requests library combined with regular expressions, provides a complete runnable script, and shows the resulting output.

Python Crawling & Data Mining

May 11, 2022

How to Scrape Baidu Tieba Titles and Images with Python Regex (Step‑by‑Step)

Introduction

A user asked how to crawl Baidu Tieba thread titles and the associated images. The initial attempt using xpath returned no results even though the page source was visible.

Why XPath Doesn't Work

The response content is not well‑formed html, so XPath selectors cannot locate the desired elements. A regular‑expression‑based approach is needed instead.

Solution Using Regex

The following Python script fetches a Tieba search page, then extracts thread titles and image URLs with a compiled regular expression.

# coding:utf-8

# @Time : 2022/5/1 10:46
# @Author: 皮皮
# @公众号: Python共享之家
# @website : http://pdcfighting.com/
# @File : 百度贴吧.py
# @Software: PyCharm
import requests
import re

class TiebaSpider:
    def __init__(self, name):
        self.start_url = "https://tieba.baidu.com/f?kw=" + name + "&ie=utf-8&pn=0"
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36",
            "Cookie": "你的cookie"
        }

    def paser_url(self, url):  # 发送请求，获取响应
        response = requests.get(url, headers=self.headers)
        return response.content.decode()

    def get_content_list(self, html_str):
        html = etree.HTML(html_str)
        pattern = re.compile(r'<li class=" j_thread_list clearfix thread_item_box".*?'
                             r'<a rel="noopener" href=".*?" title="(?P<name>.*?)".*? bpic="(?P<url>.*?)"', re.S)
        table = re.finditer(pattern, html_str)
        for data in table:
            print(data.group("name"))
            print(data.group("url"))

    def run(self):
        # 1. start_url
        # 2. 发送请求，获取响应
        html_str = self.paser_url(self.start_url)
        # 3. 提取数据，提取下一页的url地址
        self.get_content_list(html_str)
        # 4. 保存数据

if __name__ == '__main__':
    tieba_spider = TiebaSpider("李毅")
    tieba_spider.run()

Running the script prints the thread titles and their corresponding image URLs, as shown in the screenshot below.

The second image displays the actual output of the script.

Conclusion

The article demonstrates a practical method for extracting Baidu Tieba thread titles and images using Python, regular expressions, and the requests library. The author plans to publish a follow‑up tutorial that shows how to achieve the same goal with XPath.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

regex requests XPath Baidu Tieba

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.