How to Build a Self‑Healing Dynamic Proxy Pool with Scrapy and Redis
This article explains how to build a self‑healing dynamic proxy pool for 24/7 web crawling using Scrapy and Redis, covering requirements, design, implementation details, deployment steps, and a reusable Scrapy middleware example.
Why a Dynamic Proxy Pool?
Using proxy servers is the most effective way to avoid bans in web crawling, but free proxies are often unreliable and short‑lived. This guide records the process of implementing a dynamic proxy pool with Scrapy and Redis.
Requirements
Maintain a relatively stable number of proxies.
Keep a high reliability rate (aim for 90% usable proxies).
Minimize changes to existing crawler code.
Existing Solutions Reviewed
Several ready‑made projects were examined but did not fully meet the needs.
HttpProxyMiddleware – a passive proxy selection method that adds a lot of code to Scrapy middleware. Problems include: using a single proxy for most requests (risk of IP ban), slow proxy switching, tight coupling with crawler code, and Python 2 compatibility.
proxy_pool – close to the desired functionality but relies on SSDB (not available) and still requires manual handling of expired proxies.
Design and Implementation
The goal is a self‑checking, self‑repairing proxy pool.
A health‑check program runs every 10 seconds to ensure all proxies are valid.
A fetch program obtains new proxies from free proxy sites when the pool size falls below a threshold (e.g., less than 5).
A scheduler monitors pool size and triggers the above programs to keep the pool balanced.
The private proxy pool is stored in Redis (SET), allowing crawlers to fetch a random proxy directly.
Components
proxy_fetch– a Scrapy spider that crawls free proxy sites, validates proxies, and stores them in Redis. proxy_check – a Scrapy spider that validates all proxies in the pool and removes any that fail. start.py – a scheduler that manages the two spiders, launching them in separate threads.
Deployment and Usage
Update the hq-proxies.yml configuration file with Redis connection details and place it under /etc/hq-proxies.yml. Thresholds, proxy sources, and test pages can also be adjusted in this file.
The test page should be frequently accessed; a simple text file hosted on cloud storage can serve this purpose.
A Dockerfile is provided for container deployment (Python 3 image from Daocloud). When running the container, map hq-proxies.yml to /etc/hq-proxies.yml.
For manual deployment, run: pip install -r requirements.txt to install dependencies.
To use the proxy pool in Scrapy, add a middleware that retrieves a random proxy from the Redis SET for each request. Failed proxies are retried, and the pool’s self‑check ensures low probability of repeated failures.
Middleware Code Example
class DynamicProxyMiddleware(object):
def process_request(self, request, spider):
redis_db = StrictRedis(
host=LOCAL_CONFIG['REDIS_HOST'],
port=LOCAL_CONFIG['REDIS_PORT'],
password=LOCAL_CONFIG['REDIS_PASSWORD'],
db=LOCAL_CONFIG['REDIS_DB']
)
proxy = redis_db.sismember(PROXY_SET, proxy):
logger.debug('使用代理[%s]访问[%s]' % (proxy, request.url))
request.meta['proxy'] = proxyIllustrations
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
