How to Build a Simple Python Spider to Download Images from Baidu Tieba

This tutorial walks through using Python's urllib and regular expressions to crawl a Baidu Tieba page, extract all .jpg image URLs, and download each image locally with a sequential naming scheme.

AI Large-Model Wave and Transformation Guide
AI Large-Model Wave and Transformation Guide
AI Large-Model Wave and Transformation Guide
How to Build a Simple Python Spider to Download Images from Baidu Tieba

Background

A web spider (or crawler) is a program that traverses the web and retrieves resources. In Python 2 the typical modules were urllib and urllib2; starting with Python 3 they were merged into urllib.request, which provides the same functionality with a unified API.

Target task

The concrete goal is to download every image whose src attribute ends with .jpg from a Baidu Tieba thread located at https://tieba.baidu.com/p/5306226942. By opening the page in Chrome DevTools the author observed that all image tags follow the pattern src="… .jpg", making a simple regular‑expression match sufficient.

Analysis and design choices

Use urllib.request.urlopen to fetch the raw HTML because it handles HTTP GET automatically.

Decode the response bytes to UTF‑8 text; this step is required only in Python 3 where read() returns bytes.

Apply a non‑greedy regular expression src="(.*?\.jpg)" to capture each image URL. The non‑greedy qualifier prevents the pattern from spanning across multiple src attributes.

Compile the regex once with re.compile for efficiency when processing large pages.

Iterate over the resulting list and download each image with urllib.request.urlretrieve, naming files sequentially ( 0.jpg, 1.jpg, …) to avoid filename collisions.

Implementation

#coding=utf-8
import urllib.request
import re

# Target page containing the images
url = "https://tieba.baidu.com/p/5306226942"

# Fetch the page
page = urllib.request.urlopen(url)
html_bytes = page.read()

# Convert bytes to a Unicode string (Python 3)
html = html_bytes.decode('utf-8')

# Regular expression that captures .jpg URLs inside src attributes
pattern = r'src="(.*?\.jpg)"'
img_regex = re.compile(pattern)

# Extract all matching URLs
img_list = re.findall(img_regex, html)

# Download each image, naming them 0.jpg, 1.jpg, ...
for idx, img_url in enumerate(img_list):
    urllib.request.urlretrieve(img_url, f"{idx}.jpg")

Line‑by‑line explanation

#coding=utf-8

declares the source file encoding, ensuring that any non‑ASCII literals are interpreted correctly. import urllib.request loads the module that provides urlopen and urlretrieve for HTTP operations. import re loads the regular‑expression engine. url = "..." stores the target thread URL. page = urllib.request.urlopen(url) sends a GET request and returns a response object. html_bytes = page.read() reads the response body as raw bytes. html = html_bytes.decode('utf-8') converts the byte stream to a Unicode string, a mandatory step in Python 3. pattern = r'src="(.*?\.jpg)"' defines a non‑greedy pattern that captures any substring ending with .jpg inside a src attribute. img_regex = re.compile(pattern) compiles the pattern for repeated use. img_list = re.findall(img_regex, html) returns a list of all matched image URLs. for idx, img_url in enumerate(img_list): iterates over the URLs with an index. urllib.request.urlretrieve(img_url, f"{idx}.jpg") downloads each image and saves it locally with a sequential filename.

Result

Running the script creates a series of files ( 0.jpg, 1.jpg, …) in the current working directory, each containing one of the images extracted from the Baidu Tieba thread. The entire process requires only a few lines of Python code, demonstrating how a minimal web crawler can be built using the standard library.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonRegexWeb Scrapingurllibimage-downloader
AI Large-Model Wave and Transformation Guide
Written by

AI Large-Model Wave and Transformation Guide

Focuses on the latest large-model trends, applications, technical architectures, and related information.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.