Backend Development 12 min read

Master Scrapy: A Complete Guide to Building Powerful Python Web Crawlers

Scrapy is a fast, high‑level Python framework for web crawling and data extraction, featuring an asynchronous Twisted engine, modular components like spiders, pipelines, and middlewares, and includes detailed installation steps, project setup, spider creation, query syntax, recursion, and item pipelines for robust scraping.

MaGe Linux Operations

Nov 13, 2017

Master Scrapy: A Complete Guide to Building Powerful Python Web Crawlers

Scrapy is a fast, high‑level Python framework for web crawling and data extraction, widely used for data mining, monitoring, and automated testing.

It is attractive because it is a full‑featured framework that can be easily extended, providing base spider classes such as BaseSpider and sitemap spiders, and the latest version adds Web 2.0 spider support.

Scrapy uses the asynchronous Twisted network library for communication. Its overall architecture looks like this:

Scrapy mainly consists of the following components:

Engine (Scrapy) : Handles the overall data flow and triggers transactions; it is the core of the framework.

Scheduler : Receives requests from the engine, queues them, and returns the next request, acting as a priority queue that also filters duplicate URLs.

Downloader : Downloads web pages and returns the content to the spiders; it is built on Twisted’s asynchronous model.

Spiders : The workhorses that extract the required information (Items) from specific pages and can also generate new URLs for further crawling.

Pipeline : Processes extracted Items for persistence, validation, and cleaning.

Downloader Middlewares : Intercept and process requests/responses between the engine and downloader.

Spider Middlewares : Intercept and process data between the engine and spiders.

Scheduler Middlewares : Intercept communication between the engine and scheduler.

The typical Scrapy workflow is:

Engine fetches a URL from the scheduler.

Engine wraps the URL into a Request and sends it to the downloader.

Downloader retrieves the resource and returns a Response.

Spider parses the Response.

If an Item is produced, it is sent to the pipeline for further processing.

If new URLs are discovered, they are handed back to the scheduler.

Installation

Because Python 3 does not fully support Scrapy, the tutorial uses Python 2.7. On Windows, the pywin32 package is required (choose the correct 32/64‑bit version). Additional dependencies may include lxml‑3.6.4‑cp27‑cp27m‑win_amd64.whl and VCForPython27.msi.

Basic Usage

1. Create a project

Run the command:

2. Directory structure generated

Key files: scrapy.cfg: Project configuration for Scrapy commands. items.py: Defines data storage templates (similar to Django models). pipelines.py: Handles data processing such as persistence. settings.py: Configures recursion depth, concurrency, download delays, etc. spiders/: Directory for spider scripts, usually named after the target domain.

Write a Spider

Create xiaohuar_spider.py inside the spiders folder:

Key points:

Define a class inheriting from scrapy.spiders.Spider.

Set a unique name attribute; omission causes an error.

Implement a parse method – Scrapy expects this exact name.

Provide a list of start URLs; Scrapy iterates over them and sends Requests to the downloader.

Run the Spider

Navigate to the project directory and execute:

Use scrapy crawl <spider_name> --nolog to suppress logs.

Scrapy Query Syntax

Scrapy supports XPath‑like selectors for easy extraction:

All descendant div tags: //div Direct child div tags: /div Elements with a specific class: //div[@class='c1'] Elements with class and custom attribute: //div[@class='c1'][@name='alex'] Text content of a tag: //div/span/text() Attribute value, e.g.,

//a/@href

Recursive Crawling

To follow links discovered in a page, yield new Request objects from parse using a generator, allowing the spider to recursively fetch additional pages.

The recursion depth can be limited via DEPTH_LIMIT in settings.py.

Regex in Query Syntax

Selectors can incorporate regular expressions, e.g.:

Selector(response=response).xpath('//li[re:test(@class, "item-d*")]//@href').extract()

This extracts href attributes from li elements whose class matches the pattern item-d*.

Data Formatting

Define Item classes in items.py to structure scraped data, then yield Item objects from parse. Pipelines handle persistence, allowing simultaneous storage to files and databases.

MD5 hashing of URLs can be used to shorten them for caching or database keys.

Yielding an Item automatically forwards it to the configured pipelines, which can be ordered to prioritize file or database storage.

In summary, this article provides a detailed analysis and hands‑on examples of the Python web‑crawling framework Scrapy.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Scrapy Scrapy Tutorial

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.