How to Scrape Baidu Tieba Titles and Images with Python Regex (Step‑by‑Step)
This article explains why XPath fails on Baidu Tieba pages, demonstrates how to extract thread titles and image URLs using Python's requests library combined with regular expressions, provides a complete runnable script, and shows the resulting output.
Introduction
A user asked how to crawl Baidu Tieba thread titles and the associated images. The initial attempt using xpath returned no results even though the page source was visible.
Why XPath Doesn't Work
The response content is not well‑formed html, so XPath selectors cannot locate the desired elements. A regular‑expression‑based approach is needed instead.
Solution Using Regex
The following Python script fetches a Tieba search page, then extracts thread titles and image URLs with a compiled regular expression.
# coding:utf-8
# @Time : 2022/5/1 10:46
# @Author: 皮皮
# @公众号: Python共享之家
# @website : http://pdcfighting.com/
# @File : 百度贴吧.py
# @Software: PyCharm
import requests
import re
class TiebaSpider:
def __init__(self, name):
self.start_url = "https://tieba.baidu.com/f?kw=" + name + "&ie=utf-8&pn=0"
self.headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36",
"Cookie": "你的cookie"
}
def paser_url(self, url): # 发送请求,获取响应
response = requests.get(url, headers=self.headers)
return response.content.decode()
def get_content_list(self, html_str):
html = etree.HTML(html_str)
pattern = re.compile(r'<li class=" j_thread_list clearfix thread_item_box".*?'
r'<a rel="noopener" href=".*?" title="(?P<name>.*?)".*? bpic="(?P<url>.*?)"', re.S)
table = re.finditer(pattern, html_str)
for data in table:
print(data.group("name"))
print(data.group("url"))
def run(self):
# 1. start_url
# 2. 发送请求,获取响应
html_str = self.paser_url(self.start_url)
# 3. 提取数据,提取下一页的url地址
self.get_content_list(html_str)
# 4. 保存数据
if __name__ == '__main__':
tieba_spider = TiebaSpider("李毅")
tieba_spider.run()Running the script prints the thread titles and their corresponding image URLs, as shown in the screenshot below.
The second image displays the actual output of the script.
Conclusion
The article demonstrates a practical method for extracting Baidu Tieba thread titles and images using Python, regular expressions, and the requests library. The author plans to publish a follow‑up tutorial that shows how to achieve the same goal with XPath.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
