Backend Development 5 min read

How to Scrape Baidu Tieba Titles and Images with Python XPath

Learn how to scrape Baidu Tieba thread titles and associated images using Python's requests library and XPath expressions, with a complete example script that handles malformed HTML, removes comment markers, and outputs title‑image pairs for further processing.

Python Crawling & Data Mining

May 11, 2022

How to Scrape Baidu Tieba Titles and Images with Python XPath

In this article, the author demonstrates how to use Python to crawl Baidu Tieba threads and extract both the post titles and the embedded images using XPath.

1. Introduction

After encountering difficulties retrieving data with regular expressions, the author switches to XPath for more reliable extraction.

2. Implementation

The response HTML is not well‑formed, so the content must be cleaned before parsing. The provided TiebaSpider class uses requests to fetch the page, removes comment markers, and then applies XPath expressions to locate the title and image URL.

# coding:utf-8

# @Time : 2022/5/2 10:46
# @Author: 皮皮
# @公众号: Python共享之家
# @website : http://pdcfighting.com/
# @File : 百度贴吧.py
# @Software: PyCharm
import requests
from lxml import etree

class TiebaSpider:
    def __init__(self, name):
        self.start_url = "https://tieba.baidu.com/f?kw=" + name + "&ie=utf-8&pn=0"
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36",
            "Cookie": "你的cookie"
        }

    def paser_url(self, url):
        # Send request, get response
        response = requests.get(url, headers=self.headers)
        return response.content.decode()

    # Second method: xpath extraction
    def get_content_list(self, html_str):
        # Data hidden in comments, remove comment markers
        html = etree.HTML(html_str.replace('<!--', '').replace('-->', ''))
        div_list = html.xpath('//li[contains(@class,"j_thread_list clearfix thread_item_box")]')
        print(len(div_list))

        resp = []
        for h in div_list:
            title = h.xpath('.//div/a/text()')[0]
            img = h.xpath('.//ul//img/@bpic')
            img = img[0] if img else ''
            resp.append((title, img))
        print(resp)

    def run(self):
        html_str = self.paser_url(self.start_url)
        self.get_content_list(html_str)

if __name__ == '__main__':
    tieba_spider = TiebaSpider("李毅")
    tieba_spider.run()

The script runs successfully and prints a list of titles paired with their image URLs.

In the next article, the author plans to show how to achieve the same extraction using BeautifulSoup.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

requests XPath lxml Baidu Tieba

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.