Backend Development 5 min read

How to Scrape High‑Resolution Images from Huaban with Selenium and XPath

This tutorial demonstrates how to use Python, Selenium, and XPath to crawl Huaban's dynamic image boards, download high‑resolution pictures, organize them into folders, and handle varying page structures, providing a practical example of web‑scraping automation.

MaGe Linux Operations

Jun 14, 2017

How to Scrape High‑Resolution Images from Huaban with Selenium and XPath

1. Introduction

The author discovered the image‑sharing site Huaban and decided to use Selenium with XPath to crawl the "beauty" board, saving images into folders named after each category. Because the site loads content dynamically, scrolling simulation is required to retrieve more images.

2. Environment

IDE: PyCharm

Python 3.6

lxml 3.7.2

Selenium 3.4.0

requests 2.12.4

3. Example Analysis

The process starts by opening http://huaban.com/boards/favorite/beauty , extracting the URLs of all image categories, and then visiting each category page to collect the image URLs. Initially only low‑resolution thumbnails (236×354) were obtained; a second version later navigated into each thumbnail’s detail page to fetch high‑resolution images.

4. Practical Code

1) Import required modules.

2) Configure the WebDriver (e.g., Firefox or headless PhantomJS) with options such as '--load-images=false' and '--disk-cache=true' to speed up crawling. Use WebDriverWait with a 10‑second timeout and set the browser window size.

3) Define a parser(url, param) function that parses a page given its URL and a waiting condition, returning the needed elements (e.g., buttons, images).

4) Parse the main page to obtain each category’s URL and name via XPath. Some category names contain illegal characters for folder names (e.g., '*'), which must be filtered out.

5) For each category page, skip low‑resolution thumbnails and enter each thumbnail’s detail page to extract the real high‑resolution image URL. Because different images use different DOM structures, the script handles both formats and merges the resulting URL lists. img_url += img_url2 Save images locally using: filename = 'image\{}\'.format(fileName) + str(i) + '.jpg' This stores images in an image directory alongside the script, organized into subfolders named after the categories.

5. Conclusion

The exercise reinforced the use of Selenium and XPath for web scraping, highlighted common challenges in dynamic page analysis, and resulted in downloading over 500 high‑quality images.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Automation Selenium XPath Image Download

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.