Master Scrapy: Build Powerful Python Web Crawlers Step‑by‑Step

This guide introduces the Scrapy framework, explains its architecture—including engine, scheduler, downloader, spiders, pipelines, and middlewares—covers installation, project setup, item definition, spider coding, pipeline handling, pagination, and provides practical code examples for extracting data from Douban books.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Master Scrapy: Build Powerful Python Web Crawlers Step‑by‑Step

Scrapy Framework

Scrapy is a Python‑based application framework designed for crawling websites and extracting structured data, useful for data mining, information processing, or historical data storage.

It leverages Twisted, an efficient asynchronous network library, to speed up downloads without requiring developers to implement async logic, and provides many middleware interfaces for flexible extensions.

Scrapy Architecture

Scrapy Engine : Controls data flow among all components and triggers events; it acts as the crawler’s “brain”.

Scheduler : Receives requests from the engine, queues them, removes duplicate URLs (unless disabled), and supplies URLs back to the engine.

Downloader : Fetches page data and passes the response to the engine.

Spiders : User‑written classes that parse responses and generate items and follow‑up requests.

Item Pipeline : Processes extracted items (cleaning, validation, persistence to files or databases).

Downloader Middlewares : Hooks between engine and downloader to modify requests/responses (e.g., rotate User‑Agent, IP).

Spider Middlewares : Hooks between engine and spider to modify spider input/output.

Data Flow

The engine opens a domain, finds the appropriate spider, and requests the first batch of URLs.

The engine adds those URLs to the Scheduler.

The engine asks the Scheduler for the next URL.

The Scheduler returns a URL; the engine passes it through downloader middlewares to the Downloader.

The Downloader fetches the page, creates a Response, and sends it back through downloader middlewares to the engine.

The engine forwards the response through spider middlewares to the Spider.

The Spider parses the response and yields items and new requests.

The engine sends items to the Item Pipeline and new requests back to the Scheduler.

Steps 2‑8 repeat until the Scheduler has no pending requests, at which point the engine shuts down.

Note: The crawl stops only when the Scheduler is empty; failed URLs are retried automatically.

Installation

pip install wheel
pip install scrapy
# Windows: avoid compiling Twisted by installing a binary wheel
pip install Twisted-18.4.0-cp35-cp35m-win_amd64.whl

On Windows you may encounter a Visual C++ build‑tools error; the solution is to download a pre‑compiled Twisted wheel from Gohlke's site and install it, then reinstall Scrapy.

scrapy
Scrapy 1.5.0 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench      Run quick benchmark test
  check      Check spider contracts
  crawl      Run a spider
  edit       Edit spider
  fetch      Fetch a URL using the Scrapy downloader
  genspider  Generate new spider using pre‑defined templates
  list       List available spiders
  parse      Parse URL (using its spider) and print the results
  runspider  Run a self‑contained spider (without creating a project)
  settings   Get settings values
  shell      Interactive scraping console
  startproject Create new project
  version    Print Scrapy version
  view       Open URL in browser, as seen by Scrapy

Project Development Workflow

Create a project: scrapy startproject <project_name> Define items in items.py (subclass scrapy.Item and declare scrapy.Field objects).

Write spiders in spiders/ (subclass scrapy.Spider, set name, allowed_domains, start_urls, and implement parse).

Implement item pipelines in pipelines.py to process, clean, validate, or store items.

Example project structure:

first
 ├─ scrapy.cfg
 └─ first
     ├─ items.py
     ├─ middlewares.py
     ├─ pipelines.py
     ├─ settings.py
     ├─ __init__.py
     └─ spiders
         └─ __init__.py

Key settings (in settings.py) include: BOT_NAME: spider name. ROBOTSTXT_OBEY = True: obey robots.txt. USER_AGENT: custom user‑agent string. CONCURRENT_REQUESTS = 16: parallel requests. DOWNLOAD_DELAY = 3: delay between requests. COOKIES_ENABLED = False: disable cookies unless needed. SPIDER_MIDDLEWARES and DOWNLOADER_MIDDLEWARES: middleware ordering (lower number = higher priority). ITEM_PIPELINES: pipeline ordering.

Example: Scraping Douban Book Reviews

Define an item:

import scrapy

class BookItem(scrapy.Item):
    title = scrapy.Field()  # book title
    rate = scrapy.Field()   # rating

Write a spider (basic version):

import scrapy
from scrapy.http.response.html import HtmlResponse

class BookSpider(scrapy.Spider):
    name = 'doubanbook'
    allowed_domains = ['douban.com']
    start_urls = ['https://book.douban.com/tag/编程?start=0&type=T']

    def parse(self, response: HtmlResponse):
        subjects = response.xpath('//li[@class="subject-item"]')
        for s in subjects:
            title = s.xpath('.//h2/a/text()').get().strip()
            rate = s.xpath('.//span[@class="rating_nums"]/text()').get()
            item = BookItem()
            item['title'] = title
            item['rate'] = rate
            yield item
        # pagination example (next page link extraction)
        next_pages = response.xpath('//div[@class="paginator"]/span[@class="next"]/a/@href').re(r'.*start=\d+.*')
        for url in next_pages:
            yield scrapy.Request(response.urljoin(url))

Run the spider and store results:

scrapy crawl doubanbook -o books.json

Item Pipeline for JSON Output

Enable the pipeline in settings.py:

ITEM_PIPELINES = {
    'first.pipelines.FirstPipeline': 300,
}

Pipeline implementation (writes a JSON array to a file defined in custom_settings):

import simplejson as json

class FirstPipeline(object):
    def __init__(self):
        print('Pipeline initialized')

    def open_spider(self, spider):
        self.file = open(spider.settings.get('filename', 'items.json'), 'w', encoding='utf-8')
        self.file.write('[
')

    def process_item(self, item, spider):
        self.file.write(json.dumps(dict(item)) + ',
')
        return item

    def close_spider(self, spider):
        self.file.write(']')
        self.file.close()
        print('Pipeline closed')

Spider can specify the output file via custom_settings:

custom_settings = {
    'filename': 'o:/books.json',
}

Pagination and URL Extraction

To crawl subsequent pages, extract the “next” link with XPath and generate new requests:

urls = response.xpath('//div[@class="paginator"]/span[@class="next"]/a/@href').re(r'.*start=\d+.*')
for url in urls:
    yield scrapy.Request(response.urljoin(url))

All the above components together enable building a robust Scrapy project for extracting structured data from web pages.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonmiddlewareData ExtractionScrapyWeb CrawlingItem Pipeline
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.