Master Distributed Web Crawling with Scrapy‑Redis: Setup, Architecture, and Code
This guide explains how to scale web crawling to hundreds of sites using Scrapy‑Redis, covering its components, distributed workflow, Redis installation and configuration, proxy pool handling, and provides complete Python code examples for spiders and pipelines.
Why Distributed Crawling?
When a project reaches a scale that requires crawling hundreds or even thousands of websites, a single spider is insufficient; multiple servers must cooperate, similar to how Baidu’s crawler operates.
Scrapy‑Redis Overview
Scrapy‑Redis is a Scrapy component built on Redis that provides four key components to quickly create simple distributed crawlers.
Components:
Scheduler : Replaces Scrapy’s in‑memory queue with a Redis queue, allowing multiple spiders to pull requests from the same database.
Duplication Filter : Uses a Redis set to store request fingerprints and filter duplicates.
Item Pipeline : Stores scraped items into a Redis items queue.
Base Spider : Uses a custom RedisSpider that inherits from Spider and RedisMixin to read URLs from Redis.
Project repository: https://github.com/rmax/scrapy-redis
Scrapy‑Redis Working Mechanism
1. Slave nodes fetch tasks (Requests/URLs) from the Master node, crawl data, and submit newly generated requests back to the Master. 2. The Master node, backed by a single Redis instance, deduplicates requests, distributes tasks, and stores the crawled data.
Preparation Before Starting
Install Redis sudo apt-get install redis-server Modify redis.conf to comment out bind 127.0.0.1 so slaves can connect remotely. sudo nano /etc/redis/redis.conf Set up Ubuntu as Master, Windows machines as Slaves, and start the Redis service on each.
Test connection from a slave: redis-cli -h MasterIP Redis installation is now complete.
Redis Desktop Manager (visual management tool) can be downloaded from https://redisdesktop.com/download .
Obtaining an IP Proxy Pool
Large‑scale crawlers need to rotate IPs to avoid anti‑scraping mechanisms. Free proxies vary in quality; paid proxies are more reliable. Example using Xici proxy:
<code class="language-python">class XiciSpider(scrapy.Spider):
name = 'xici'
allowed_domains = ['xicidaili.com']
start_urls = []
for i in range(1, 6):
start_urls.append('http://www.xicidaili.com/nn/' + str(i))
def parse(self, response):
ip = response.xpath('//tr[@class]/td[2]/text()').extract()
port = response.xpath('//tr[@class]/td[3]/text()').extract()
agreement_type = response.xpath('//tr[@class]/td[6]/text()').extract()
proxies = zip(ip, port, agreement_type)
for ip, port, agreement_type in proxies:
proxy = {
'http': agreement_type.lower() + '://' + ip + ':' + port,
'https': agreement_type.lower() + '://' + ip + ':' + port
}
try:
resp = requests.get('http://icanhazip.com', proxies=proxy, timeout=2)
if resp.status_code == 200:
item = DailiItem()
item['proxy'] = proxy
yield item
except:
pass
</code>Pipeline to save valid proxies:
<code class="language-python">class DailiPipeline(object):
def __init__(self):
self.file = open('proxy.txt', 'w')
def process_item(self, item, spider):
self.file.write(str(item['proxy']) + '
')
return item
def close_spider(self, spider):
self.file.close()
</code>Running the spider yields a result image (omitted). In a test, 500 proxies were scraped but only four were usable.
Project code repository: https://github.com/ZhiqiKou/Scrapy_notes
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
