Backend Development 9 min read

How to Scrape Baidu Tieba Images & Videos with Python: A Step‑by‑Step Guide

This tutorial explains how to use Python's requests, lxml, and urllib libraries to search Baidu Tieba by keyword, bypass anti‑crawling measures, extract image and video URLs with XPath, and save the media files locally, complete with code examples and screenshots.

Python Crawling & Data Mining

Dec 20, 2020

How to Scrape Baidu Tieba Images & Videos with Python: A Step‑by‑Step Guide

Project Background

Baidu Tieba is the largest Chinese forum, and users often want to download images or videos that appear in comment sections.

Project Goal

Automatically save the retrieved images or videos into a local folder.

Libraries and Target Site

Target URL: https://tieba.baidu.com/f?ie=utf-8&kw=吴京&fr=search

Required libraries: requests , lxml , urllib

Project Analysis

1. Handling anti‑crawling measures

Requests without proper headers receive no data, and making many requests from the same IP quickly results in IP blocking. The solution is to use normal HTTP request headers.

2. Implementing keyword search

Insert the desired keyword into the kw parameter of the URL (e.g., kw=()) and iterate over the result pages.

Implementation

1. Define the BaiduImageSpider class

import requests
from lxml import etree
from urllib import parse

class BaiduImageSpider(object):
    def __init__(self, tieba_name):
        self.tieba_name = tieba_name
        self.url = "http://tieba.baidu.com/f?kw={}&ie=utf-8&pn=0"
        self.headers = {
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0)'
        }

    def get_parse_page(self, url, xpath):
        html = requests.get(url=url, headers=self.headers).content.decode("utf-8")
        parse_html = etree.HTML(html)
        return parse_html.xpath(xpath)

    def get_tlink(self, url):
        xpath = '//div[@class="threadlist_lz clearfix"]/div/a/@href'
        t_list = self.get_parse_page(url, xpath)
        for t in t_list:
            t_link = "http://www.tieba.com" + t
            self.write_image(t_link)

    def write_image(self, t_link):
        xpath = "//div[@class='d_post_content j_d_post_content clearfix']/img[@class='BDE_Image']/@src | //div[@class='video_src_wrapper']/embed/@data-video"
        img_list = self.get_parse_page(t_link, xpath)
        for img_link in img_list:
            data = requests.get(url=img_link, headers=self.headers).content
            filename = "百度/" + img_link[-10:]
            with open(filename, 'wb') as f:
                f.write(data)
                print("%s下载成功" % filename)

    def main(self):
        url = self.url.format(self.tieba_name)
        # further processing here

if __name__ == '__main__':
    inout_word = input("请输入你要查询的信息:")
    key_word = parse.quote(inout_word)
    spider = BaiduImageSpider(key_word)
    spider.main()

2. Using the Chrome Xpath plugin

Install chrome_Xpath_v2.0.2.crx, enable developer mode, load the unpacked extension, and use the plugin to copy the XPath of desired elements.

After copying the XPath, right‑click and select “Copy XPath”.

3. Saving the data

The write_image method downloads each image or video URL and saves it under a folder named “百度”. The folder must exist beforehand.

Result Demonstration

Run the script, input a keyword (e.g., 吴京), and press Enter. The images are saved in the “百度” folder, and any MP4 video files from the comment section are also downloaded.

Conclusion

Avoid excessive crawling to reduce server load; moderate usage is recommended.

This project demonstrates a practical Python web‑scraping solution for Baidu Tieba, covering common anti‑crawling challenges and providing concrete code.

Hands‑on implementation helps deepen understanding of the requests library, lxml parsing, and XPath data extraction.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python data extraction Web Scraping requests XPath Baidu Tieba

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.