How to Build a Python Web Scraper that Automatically Downloads and Organizes Images
This tutorial walks you through creating a Python web scraper that extracts images from a target site, handles anti‑scraping measures, saves them into categorized folders, and logs the download results, while explaining the required libraries, code structure, and best practices.
Project Background
BoHaiShiBei is an online education team focused on internet professionals. This article teaches how to use Python to crawl images from their platform, categorize them, and save them into documents.
Project Goal
Create a folder, download all article images into it, and display success messages in the console.
Project Analysis
1. Identify the real request URLs by inspecting network traffic (F12) and observing pagination patterns such as https://bh.sb/page/1/, https://bh.sb/page/2/, etc. Use a loop to generate URLs.
Anti‑Scraping Measures
• Use legitimate HTTP request headers. • Generate random User‑Agent strings with fake_useragent.
Libraries and Tools
Required libraries: requests , lxml , fake_useragent , time , os . Development tool: PyCharm .
Implementation
1. Class Definition
import requests, os
from lxml import etree
from fake_useragent import UserAgent
import time
class bnotiank(object):
def __init__(self):
os.mkdir("图片") # create folder (run only once)
def main(self):
pass
if __name__ == '__main__':
Siper = bnotiank()
Siper.main()2. Random UserAgent and Headers
ua = UserAgent(verify_ssl=False)
for i in range(1, 50):
self.headers = {
'User-Agent': ua.random
}3. Send Request and Get Response
def get_page(self, url):
res = requests.get(url=url, headers=self.headers)
html = res.content.decode("utf-8")
return html4. Parse Page for Image Links
def parse_page(self, html):
parse_html = etree.HTML(html)
image_src_list = parse_html.xpath('//p/a/@href')5. Extract Image URL and Title
reo = parse_html1.xpath('//div//div[@class="content"]')
for j in reo:
d = j.xpath('.//article[@class="article-content"]//p/img/@src')[0]
text = parse_html1.xpath('//h1[@class ="article-title"] //a/text()')[0].strip()6. Download Image and Save
html2 = requests.get(url=d, headers=self.headers).content
dirname = "./d/" + text + ".jpg"
with open(dirname, 'wb') as f:
f.write(html2)
print("%s 【下载成功!!!!】" % text)7. Execute Workflow
url = self.url.format(page)
print(url)
html = self.get_page(url)
self.parse_page(html)8. Delay Between Requests
time.sleep(1) # prevent IP blockingResult Demonstration
Running the script shows start and end page inputs, prints download success messages in the console, and saves images with titles as filenames.
Console output confirms successful downloads.
Images are saved with their titles as filenames.
Summary
Do not scrape excessive data to avoid overloading the server. This guide demonstrates anti‑scraping techniques, uses Python libraries to fetch and organize images, and helps understand XPath, string concatenation, and the format function. Practice and troubleshooting are essential for deeper comprehension.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
