Backend Development 8 min read

How to Scrape Images and Videos from Baidu Tieba Using Python

This tutorial explains how to build a Python web‑scraper that searches Baidu Tieba by keyword, bypasses anti‑crawling measures, extracts image and video URLs with XPath, and saves the media files locally, complete with code examples and setup instructions.

Python Crawling & Data Mining

May 18, 2020

How to Scrape Images and Videos from Baidu Tieba Using Python

Project Background

Baidu Tieba is the largest Chinese discussion platform, and users often want to download images or videos posted in comment sections.

Project Goals

Automatically retrieve images or videos from Tieba search results and save them to a local folder.

Libraries and Websites

Target URL example: https://tieba.baidu.com/f?ie=utf-8&kw=吴京&fr=search

Required Python libraries: requests , lxml , urllib .

Project Analysis

1. Handling anti‑crawling measures

Direct requests without headers are blocked, and repeated requests from the same IP can lead to IP bans. The solution is to use realistic HTTP request headers.

2. Searching by keyword

The keyword is inserted into the kw parameter of the Tieba URL, allowing iteration over multiple terms.

Implementation

1. Define the spider class

import requests
from lxml import etree
from urllib import parse

class BaiduImageSpider(object):
    def __init__(self, tieba_name):
        pass
    def main(self):
        pass

if __name__ == '__main__':
    inout_word = input("请输入你要查询的信息:")
    spider = BaiduImageSpider()
    spider.main()

2. Prepare URL and request headers

import requests
from lxml import etree
from urllib import parse

class BaiduImageSpider(object):
    def __init__(self, tieba_name):
        self.tieba_name = tieba_name
        self.url = "http://tieba.baidu.com/f?kw={}&ie=utf-8&pn=0"
        self.headers = {
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0)'
        }

    def get_parse_page(self, url, xpath):
        html = requests.get(url=url, headers=self.headers).content.decode("utf-8")
        parse_html = etree.HTML(html)
        return parse_html.xpath(xpath)

    def main(self):
        url = self.url.format(self.tieba_name)
        # further processing

if __name__ == '__main__':
    inout_word = input("请输入你要查询的信息:")
    key_word = parse.quote(inout_word)
    spider = BaiduImageSpider(key_word)
    spider.main()

3. Use XPath to extract data

Install the Chrome_XPath extension to obtain accurate XPath expressions.

Example XPath for media links:

//div[@class='threadlist_lz clearfix']/div/a/@href

4. Save the media files

def write_image(self, t_link):
    xpath = "//div[@class='d_post_content j_d_post_content clearfix']/img[@class='BDE_Image']/@src | //div[@class='video_src_wrapper']/embed/@data-video"
    img_list = self.get_parse_page(t_link, xpath)
    for img_link in img_list:
        html = requests.get(url=img_link, headers=self.headers).content
        filename = "百度/" + img_link[-10:]
        with open(filename, 'wb') as f:
            f.write(html)
            print("%s下载成功" % filename)

Effect Demonstration

Run the script, input a keyword (e.g., "吴京"), and the program creates a folder named "百度" to store downloaded images and videos.

Conclusion

Do not scrape excessive data to avoid overloading the server. This Python crawler demonstrates how to handle common anti‑crawling techniques, use requests and lxml for parsing, and save media files locally, providing a practical example for beginners.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Web Scraping requests XPath lxml Baidu Tieba

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.