Master Scrapy: A Complete Guide to Building Powerful Python Web Crawlers
Scrapy is a fast, high‑level Python framework for web crawling and data extraction, featuring an asynchronous Twisted engine, modular components like spiders, pipelines, and middlewares, and includes detailed installation steps, project setup, spider creation, query syntax, recursion, and item pipelines for robust scraping.
Scrapy is a fast, high‑level Python framework for web crawling and data extraction, widely used for data mining, monitoring, and automated testing.
It is attractive because it is a full‑featured framework that can be easily extended, providing base spider classes such as BaseSpider and sitemap spiders, and the latest version adds Web 2.0 spider support.
Scrapy uses the asynchronous Twisted network library for communication. Its overall architecture looks like this:
Scrapy mainly consists of the following components:
Engine (Scrapy) : Handles the overall data flow and triggers transactions; it is the core of the framework.
Scheduler : Receives requests from the engine, queues them, and returns the next request, acting as a priority queue that also filters duplicate URLs.
Downloader : Downloads web pages and returns the content to the spiders; it is built on Twisted’s asynchronous model.
Spiders : The workhorses that extract the required information (Items) from specific pages and can also generate new URLs for further crawling.
Pipeline : Processes extracted Items for persistence, validation, and cleaning.
Downloader Middlewares : Intercept and process requests/responses between the engine and downloader.
Spider Middlewares : Intercept and process data between the engine and spiders.
Scheduler Middlewares : Intercept communication between the engine and scheduler.
The typical Scrapy workflow is:
Engine fetches a URL from the scheduler.
Engine wraps the URL into a Request and sends it to the downloader.
Downloader retrieves the resource and returns a Response.
Spider parses the Response.
If an Item is produced, it is sent to the pipeline for further processing.
If new URLs are discovered, they are handed back to the scheduler.
Installation
Because Python 3 does not fully support Scrapy, the tutorial uses Python 2.7. On Windows, the pywin32 package is required (choose the correct 32/64‑bit version). Additional dependencies may include lxml‑3.6.4‑cp27‑cp27m‑win_amd64.whl and VCForPython27.msi.
Basic Usage
1. Create a project
Run the command:
2. Directory structure generated
Key files: scrapy.cfg: Project configuration for Scrapy commands. items.py: Defines data storage templates (similar to Django models). pipelines.py: Handles data processing such as persistence. settings.py: Configures recursion depth, concurrency, download delays, etc. spiders/: Directory for spider scripts, usually named after the target domain.
Write a Spider
Create xiaohuar_spider.py inside the spiders folder:
Key points:
Define a class inheriting from scrapy.spiders.Spider.
Set a unique name attribute; omission causes an error.
Implement a parse method – Scrapy expects this exact name.
Provide a list of start URLs; Scrapy iterates over them and sends Requests to the downloader.
Run the Spider
Navigate to the project directory and execute:
Use scrapy crawl <spider_name> --nolog to suppress logs.
Scrapy Query Syntax
Scrapy supports XPath‑like selectors for easy extraction:
All descendant div tags: //div Direct child div tags: /div Elements with a specific class: //div[@class='c1'] Elements with class and custom attribute: //div[@class='c1'][@name='alex'] Text content of a tag: //div/span/text() Attribute value, e.g.,
//a/@hrefRecursive Crawling
To follow links discovered in a page, yield new Request objects from parse using a generator, allowing the spider to recursively fetch additional pages.
The recursion depth can be limited via DEPTH_LIMIT in settings.py.
Regex in Query Syntax
Selectors can incorporate regular expressions, e.g.:
Selector(response=response).xpath('//li[re:test(@class, "item-d*")]//@href').extract()This extracts href attributes from li elements whose class matches the pattern item-d*.
Data Formatting
Define Item classes in items.py to structure scraped data, then yield Item objects from parse. Pipelines handle persistence, allowing simultaneous storage to files and databases.
MD5 hashing of URLs can be used to shorten them for caching or database keys.
Yielding an Item automatically forwards it to the configured pipelines, which can be ordered to prioritize file or database storage.
In summary, this article provides a detailed analysis and hands‑on examples of the Python web‑crawling framework Scrapy.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
