How to Scrape Baidu Tieba Images & Videos with Python: A Step‑by‑Step Guide
This tutorial explains how to use Python's requests, lxml, and urllib libraries to search Baidu Tieba by keyword, bypass anti‑crawling measures, extract image and video URLs with XPath, and save the media files locally, complete with code examples and screenshots.
Project Background
Baidu Tieba is the largest Chinese forum, and users often want to download images or videos that appear in comment sections.
Project Goal
Automatically save the retrieved images or videos into a local folder.
Libraries and Target Site
Target URL: https://tieba.baidu.com/f?ie=utf-8&kw=吴京&fr=search
Required libraries: requests , lxml , urllib
Project Analysis
1. Handling anti‑crawling measures
Requests without proper headers receive no data, and making many requests from the same IP quickly results in IP blocking. The solution is to use normal HTTP request headers.
2. Implementing keyword search
Insert the desired keyword into the kw parameter of the URL (e.g., kw=()) and iterate over the result pages.
Implementation
1. Define the BaiduImageSpider class
import requests
from lxml import etree
from urllib import parse
class BaiduImageSpider(object):
def __init__(self, tieba_name):
self.tieba_name = tieba_name
self.url = "http://tieba.baidu.com/f?kw={}&ie=utf-8&pn=0"
self.headers = {
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0)'
}
def get_parse_page(self, url, xpath):
html = requests.get(url=url, headers=self.headers).content.decode("utf-8")
parse_html = etree.HTML(html)
return parse_html.xpath(xpath)
def get_tlink(self, url):
xpath = '//div[@class="threadlist_lz clearfix"]/div/a/@href'
t_list = self.get_parse_page(url, xpath)
for t in t_list:
t_link = "http://www.tieba.com" + t
self.write_image(t_link)
def write_image(self, t_link):
xpath = "//div[@class='d_post_content j_d_post_content clearfix']/img[@class='BDE_Image']/@src | //div[@class='video_src_wrapper']/embed/@data-video"
img_list = self.get_parse_page(t_link, xpath)
for img_link in img_list:
data = requests.get(url=img_link, headers=self.headers).content
filename = "百度/" + img_link[-10:]
with open(filename, 'wb') as f:
f.write(data)
print("%s下载成功" % filename)
def main(self):
url = self.url.format(self.tieba_name)
# further processing here
if __name__ == '__main__':
inout_word = input("请输入你要查询的信息:")
key_word = parse.quote(inout_word)
spider = BaiduImageSpider(key_word)
spider.main()2. Using the Chrome Xpath plugin
Install chrome_Xpath_v2.0.2.crx, enable developer mode, load the unpacked extension, and use the plugin to copy the XPath of desired elements.
After copying the XPath, right‑click and select “Copy XPath”.
3. Saving the data
The write_image method downloads each image or video URL and saves it under a folder named “百度”. The folder must exist beforehand.
Result Demonstration
Run the script, input a keyword (e.g., 吴京), and press Enter. The images are saved in the “百度” folder, and any MP4 video files from the comment section are also downloaded.
Conclusion
Avoid excessive crawling to reduce server load; moderate usage is recommended.
This project demonstrates a practical Python web‑scraping solution for Baidu Tieba, covering common anti‑crawling challenges and providing concrete code.
Hands‑on implementation helps deepen understanding of the requests library, lxml parsing, and XPath data extraction.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
