Master Scrapy: Build Powerful Python Web Crawlers Step‑by‑Step
This guide introduces the Scrapy framework, explains its architecture—including engine, scheduler, downloader, spiders, pipelines, and middlewares—covers installation, project setup, item definition, spider coding, pipeline handling, pagination, and provides practical code examples for extracting data from Douban books.
Scrapy Framework
Scrapy is a Python‑based application framework designed for crawling websites and extracting structured data, useful for data mining, information processing, or historical data storage.
It leverages Twisted, an efficient asynchronous network library, to speed up downloads without requiring developers to implement async logic, and provides many middleware interfaces for flexible extensions.
Scrapy Architecture
Scrapy Engine : Controls data flow among all components and triggers events; it acts as the crawler’s “brain”.
Scheduler : Receives requests from the engine, queues them, removes duplicate URLs (unless disabled), and supplies URLs back to the engine.
Downloader : Fetches page data and passes the response to the engine.
Spiders : User‑written classes that parse responses and generate items and follow‑up requests.
Item Pipeline : Processes extracted items (cleaning, validation, persistence to files or databases).
Downloader Middlewares : Hooks between engine and downloader to modify requests/responses (e.g., rotate User‑Agent, IP).
Spider Middlewares : Hooks between engine and spider to modify spider input/output.
Data Flow
The engine opens a domain, finds the appropriate spider, and requests the first batch of URLs.
The engine adds those URLs to the Scheduler.
The engine asks the Scheduler for the next URL.
The Scheduler returns a URL; the engine passes it through downloader middlewares to the Downloader.
The Downloader fetches the page, creates a Response, and sends it back through downloader middlewares to the engine.
The engine forwards the response through spider middlewares to the Spider.
The Spider parses the response and yields items and new requests.
The engine sends items to the Item Pipeline and new requests back to the Scheduler.
Steps 2‑8 repeat until the Scheduler has no pending requests, at which point the engine shuts down.
Note: The crawl stops only when the Scheduler is empty; failed URLs are retried automatically.
Installation
pip install wheel
pip install scrapy
# Windows: avoid compiling Twisted by installing a binary wheel
pip install Twisted-18.4.0-cp35-cp35m-win_amd64.whlOn Windows you may encounter a Visual C++ build‑tools error; the solution is to download a pre‑compiled Twisted wheel from Gohlke's site and install it, then reinstall Scrapy.
scrapy
Scrapy 1.5.0 - no active project
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
check Check spider contracts
crawl Run a spider
edit Edit spider
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre‑defined templates
list List available spiders
parse Parse URL (using its spider) and print the results
runspider Run a self‑contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by ScrapyProject Development Workflow
Create a project: scrapy startproject <project_name> Define items in items.py (subclass scrapy.Item and declare scrapy.Field objects).
Write spiders in spiders/ (subclass scrapy.Spider, set name, allowed_domains, start_urls, and implement parse).
Implement item pipelines in pipelines.py to process, clean, validate, or store items.
Example project structure:
first
├─ scrapy.cfg
└─ first
├─ items.py
├─ middlewares.py
├─ pipelines.py
├─ settings.py
├─ __init__.py
└─ spiders
└─ __init__.pyKey settings (in settings.py) include: BOT_NAME: spider name. ROBOTSTXT_OBEY = True: obey robots.txt. USER_AGENT: custom user‑agent string. CONCURRENT_REQUESTS = 16: parallel requests. DOWNLOAD_DELAY = 3: delay between requests. COOKIES_ENABLED = False: disable cookies unless needed. SPIDER_MIDDLEWARES and DOWNLOADER_MIDDLEWARES: middleware ordering (lower number = higher priority). ITEM_PIPELINES: pipeline ordering.
Example: Scraping Douban Book Reviews
Define an item:
import scrapy
class BookItem(scrapy.Item):
title = scrapy.Field() # book title
rate = scrapy.Field() # ratingWrite a spider (basic version):
import scrapy
from scrapy.http.response.html import HtmlResponse
class BookSpider(scrapy.Spider):
name = 'doubanbook'
allowed_domains = ['douban.com']
start_urls = ['https://book.douban.com/tag/编程?start=0&type=T']
def parse(self, response: HtmlResponse):
subjects = response.xpath('//li[@class="subject-item"]')
for s in subjects:
title = s.xpath('.//h2/a/text()').get().strip()
rate = s.xpath('.//span[@class="rating_nums"]/text()').get()
item = BookItem()
item['title'] = title
item['rate'] = rate
yield item
# pagination example (next page link extraction)
next_pages = response.xpath('//div[@class="paginator"]/span[@class="next"]/a/@href').re(r'.*start=\d+.*')
for url in next_pages:
yield scrapy.Request(response.urljoin(url))Run the spider and store results:
scrapy crawl doubanbook -o books.jsonItem Pipeline for JSON Output
Enable the pipeline in settings.py:
ITEM_PIPELINES = {
'first.pipelines.FirstPipeline': 300,
}Pipeline implementation (writes a JSON array to a file defined in custom_settings):
import simplejson as json
class FirstPipeline(object):
def __init__(self):
print('Pipeline initialized')
def open_spider(self, spider):
self.file = open(spider.settings.get('filename', 'items.json'), 'w', encoding='utf-8')
self.file.write('[
')
def process_item(self, item, spider):
self.file.write(json.dumps(dict(item)) + ',
')
return item
def close_spider(self, spider):
self.file.write(']')
self.file.close()
print('Pipeline closed')Spider can specify the output file via custom_settings:
custom_settings = {
'filename': 'o:/books.json',
}Pagination and URL Extraction
To crawl subsequent pages, extract the “next” link with XPath and generate new requests:
urls = response.xpath('//div[@class="paginator"]/span[@class="next"]/a/@href').re(r'.*start=\d+.*')
for url in urls:
yield scrapy.Request(response.urljoin(url))All the above components together enable building a robust Scrapy project for extracting structured data from web pages.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
