Backend Development 6 min read

How to Build a Python Web Scraper that Automatically Downloads and Organizes Images

This tutorial walks you through creating a Python web scraper that extracts images from a target site, handles anti‑scraping measures, saves them into categorized folders, and logs the download results, while explaining the required libraries, code structure, and best practices.

Python Crawling & Data Mining

Jul 29, 2020

How to Build a Python Web Scraper that Automatically Downloads and Organizes Images

Project Background

BoHaiShiBei is an online education team focused on internet professionals. This article teaches how to use Python to crawl images from their platform, categorize them, and save them into documents.

Project Goal

Create a folder, download all article images into it, and display success messages in the console.

Project Analysis

1. Identify the real request URLs by inspecting network traffic (F12) and observing pagination patterns such as https://bh.sb/page/1/, https://bh.sb/page/2/, etc. Use a loop to generate URLs.

Anti‑Scraping Measures

• Use legitimate HTTP request headers. • Generate random User‑Agent strings with fake_useragent.

Libraries and Tools

Required libraries: requests , lxml , fake_useragent , time , os . Development tool: PyCharm .

Implementation

1. Class Definition

import requests, os
from lxml import etree
from fake_useragent import UserAgent
import time

class bnotiank(object):
    def __init__(self):
        os.mkdir("图片")  # create folder (run only once)
    def main(self):
        pass

if __name__ == '__main__':
    Siper = bnotiank()
    Siper.main()

2. Random UserAgent and Headers

ua = UserAgent(verify_ssl=False)
for i in range(1, 50):
    self.headers = {
        'User-Agent': ua.random
    }

3. Send Request and Get Response

def get_page(self, url):
    res = requests.get(url=url, headers=self.headers)
    html = res.content.decode("utf-8")
    return html

4. Parse Page for Image Links

def parse_page(self, html):
    parse_html = etree.HTML(html)
    image_src_list = parse_html.xpath('//p/a/@href')

5. Extract Image URL and Title

reo = parse_html1.xpath('//div//div[@class="content"]')
for j in reo:
    d = j.xpath('.//article[@class="article-content"]//p/img/@src')[0]
    text = parse_html1.xpath('//h1[@class ="article-title"] //a/text()')[0].strip()

6. Download Image and Save

html2 = requests.get(url=d, headers=self.headers).content
dirname = "./d/" + text + ".jpg"
with open(dirname, 'wb') as f:
    f.write(html2)
    print("%s 【下载成功！！！！】" % text)

7. Execute Workflow

url = self.url.format(page)
print(url)
html = self.get_page(url)
self.parse_page(html)

8. Delay Between Requests

time.sleep(1)  # prevent IP blocking

Result Demonstration

Running the script shows start and end page inputs, prints download success messages in the console, and saves images with titles as filenames.

Console output confirms successful downloads.

Images are saved with their titles as filenames.

Summary

Do not scrape excessive data to avoid overloading the server. This guide demonstrates anti‑scraping techniques, uses Python libraries to fetch and organize images, and helps understand XPath, string concatenation, and the format function. Practice and troubleshooting are essential for deeper comprehension.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Web Scraping requests XPath anti-scraping Image Download

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.