How to Scrape Baidu Tieba Titles and Images with Python XPath
Learn how to scrape Baidu Tieba thread titles and associated images using Python's requests library and XPath expressions, with a complete example script that handles malformed HTML, removes comment markers, and outputs title‑image pairs for further processing.
In this article, the author demonstrates how to use Python to crawl Baidu Tieba threads and extract both the post titles and the embedded images using XPath.
1. Introduction
After encountering difficulties retrieving data with regular expressions, the author switches to XPath for more reliable extraction.
2. Implementation
The response HTML is not well‑formed, so the content must be cleaned before parsing. The provided TiebaSpider class uses requests to fetch the page, removes comment markers, and then applies XPath expressions to locate the title and image URL.
# coding:utf-8
# @Time : 2022/5/2 10:46
# @Author: 皮皮
# @公众号: Python共享之家
# @website : http://pdcfighting.com/
# @File : 百度贴吧.py
# @Software: PyCharm
import requests
from lxml import etree
class TiebaSpider:
def __init__(self, name):
self.start_url = "https://tieba.baidu.com/f?kw=" + name + "&ie=utf-8&pn=0"
self.headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36",
"Cookie": "你的cookie"
}
def paser_url(self, url):
# Send request, get response
response = requests.get(url, headers=self.headers)
return response.content.decode()
# Second method: xpath extraction
def get_content_list(self, html_str):
# Data hidden in comments, remove comment markers
html = etree.HTML(html_str.replace('<!--', '').replace('-->', ''))
div_list = html.xpath('//li[contains(@class,"j_thread_list clearfix thread_item_box")]')
print(len(div_list))
resp = []
for h in div_list:
title = h.xpath('.//div/a/text()')[0]
img = h.xpath('.//ul//img/@bpic')
img = img[0] if img else ''
resp.append((title, img))
print(resp)
def run(self):
html_str = self.paser_url(self.start_url)
self.get_content_list(html_str)
if __name__ == '__main__':
tieba_spider = TiebaSpider("李毅")
tieba_spider.run()The script runs successfully and prints a list of titles paired with their image URLs.
In the next article, the author plans to show how to achieve the same extraction using BeautifulSoup.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
