How to Scrape Images and Videos from Baidu Tieba Using Python
This tutorial explains how to build a Python web‑scraper that searches Baidu Tieba by keyword, bypasses anti‑crawling measures, extracts image and video URLs with XPath, and saves the media files locally, complete with code examples and setup instructions.
Project Background
Baidu Tieba is the largest Chinese discussion platform, and users often want to download images or videos posted in comment sections.
Project Goals
Automatically retrieve images or videos from Tieba search results and save them to a local folder.
Libraries and Websites
Target URL example: https://tieba.baidu.com/f?ie=utf-8&kw=吴京&fr=search
Required Python libraries: requests , lxml , urllib .
Project Analysis
1. Handling anti‑crawling measures
Direct requests without headers are blocked, and repeated requests from the same IP can lead to IP bans. The solution is to use realistic HTTP request headers.
2. Searching by keyword
The keyword is inserted into the kw parameter of the Tieba URL, allowing iteration over multiple terms.
Implementation
1. Define the spider class
import requests
from lxml import etree
from urllib import parse
class BaiduImageSpider(object):
def __init__(self, tieba_name):
pass
def main(self):
pass
if __name__ == '__main__':
inout_word = input("请输入你要查询的信息:")
spider = BaiduImageSpider()
spider.main()2. Prepare URL and request headers
import requests
from lxml import etree
from urllib import parse
class BaiduImageSpider(object):
def __init__(self, tieba_name):
self.tieba_name = tieba_name
self.url = "http://tieba.baidu.com/f?kw={}&ie=utf-8&pn=0"
self.headers = {
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0)'
}
def get_parse_page(self, url, xpath):
html = requests.get(url=url, headers=self.headers).content.decode("utf-8")
parse_html = etree.HTML(html)
return parse_html.xpath(xpath)
def main(self):
url = self.url.format(self.tieba_name)
# further processing
if __name__ == '__main__':
inout_word = input("请输入你要查询的信息:")
key_word = parse.quote(inout_word)
spider = BaiduImageSpider(key_word)
spider.main()3. Use XPath to extract data
Install the Chrome_XPath extension to obtain accurate XPath expressions.
Example XPath for media links:
//div[@class='threadlist_lz clearfix']/div/a/@href4. Save the media files
def write_image(self, t_link):
xpath = "//div[@class='d_post_content j_d_post_content clearfix']/img[@class='BDE_Image']/@src | //div[@class='video_src_wrapper']/embed/@data-video"
img_list = self.get_parse_page(t_link, xpath)
for img_link in img_list:
html = requests.get(url=img_link, headers=self.headers).content
filename = "百度/" + img_link[-10:]
with open(filename, 'wb') as f:
f.write(html)
print("%s下载成功" % filename)Effect Demonstration
Run the script, input a keyword (e.g., "吴京"), and the program creates a folder named "百度" to store downloaded images and videos.
Conclusion
Do not scrape excessive data to avoid overloading the server. This Python crawler demonstrates how to handle common anti‑crawling techniques, use requests and lxml for parsing, and save media files locally, providing a practical example for beginners.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
