Backend Development 7 min read

Master Distributed Web Crawling with Scrapy‑Redis: Setup, Architecture, and Code

This guide explains how to scale web crawling to hundreds of sites using Scrapy‑Redis, covering its components, distributed workflow, Redis installation and configuration, proxy pool handling, and provides complete Python code examples for spiders and pipelines.

MaGe Linux Operations

Oct 13, 2018

Master Distributed Web Crawling with Scrapy‑Redis: Setup, Architecture, and Code

Why Distributed Crawling?

When a project reaches a scale that requires crawling hundreds or even thousands of websites, a single spider is insufficient; multiple servers must cooperate, similar to how Baidu’s crawler operates.

Scrapy‑Redis Overview

Scrapy‑Redis is a Scrapy component built on Redis that provides four key components to quickly create simple distributed crawlers.

Components:

Scheduler : Replaces Scrapy’s in‑memory queue with a Redis queue, allowing multiple spiders to pull requests from the same database.

Duplication Filter : Uses a Redis set to store request fingerprints and filter duplicates.

Item Pipeline : Stores scraped items into a Redis items queue.

Base Spider : Uses a custom RedisSpider that inherits from Spider and RedisMixin to read URLs from Redis.

Project repository: https://github.com/rmax/scrapy-redis

Scrapy‑Redis Working Mechanism

1. Slave nodes fetch tasks (Requests/URLs) from the Master node, crawl data, and submit newly generated requests back to the Master. 2. The Master node, backed by a single Redis instance, deduplicates requests, distributes tasks, and stores the crawled data.

Preparation Before Starting

Install Redis sudo apt-get install redis-server Modify redis.conf to comment out bind 127.0.0.1 so slaves can connect remotely. sudo nano /etc/redis/redis.conf Set up Ubuntu as Master, Windows machines as Slaves, and start the Redis service on each.

Test connection from a slave: redis-cli -h MasterIP Redis installation is now complete.

Redis Desktop Manager (visual management tool) can be downloaded from https://redisdesktop.com/download .

Obtaining an IP Proxy Pool

Large‑scale crawlers need to rotate IPs to avoid anti‑scraping mechanisms. Free proxies vary in quality; paid proxies are more reliable. Example using Xici proxy:

<code class="language-python">class XiciSpider(scrapy.Spider):
    name = 'xici'
    allowed_domains = ['xicidaili.com']
    start_urls = []
    for i in range(1, 6):
        start_urls.append('http://www.xicidaili.com/nn/' + str(i))

    def parse(self, response):
        ip = response.xpath('//tr[@class]/td[2]/text()').extract()
        port = response.xpath('//tr[@class]/td[3]/text()').extract()
        agreement_type = response.xpath('//tr[@class]/td[6]/text()').extract()
        proxies = zip(ip, port, agreement_type)
        for ip, port, agreement_type in proxies:
            proxy = {
                'http': agreement_type.lower() + '://' + ip + ':' + port,
                'https': agreement_type.lower() + '://' + ip + ':' + port
            }
            try:
                resp = requests.get('http://icanhazip.com', proxies=proxy, timeout=2)
                if resp.status_code == 200:
                    item = DailiItem()
                    item['proxy'] = proxy
                    yield item
            except:
                pass
</code>

Pipeline to save valid proxies:

<code class="language-python">class DailiPipeline(object):
    def __init__(self):
        self.file = open('proxy.txt', 'w')

    def process_item(self, item, spider):
        self.file.write(str(item['proxy']) + '
')
        return item

    def close_spider(self, spider):
        self.file.close()
</code>

Running the spider yields a result image (omitted). In a test, 500 proxies were scraped but only four were usable.

Project code repository: https://github.com/ZhiqiKou/Scrapy_notes

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

proxy Python Web Scraping distributed crawling

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.