Master Scrapy: Build Powerful Python Crawlers Step‑by‑Step

This tutorial walks you through the fundamentals of Scrapy, covering its architecture, project setup, spider creation, item and pipeline definitions, pagination techniques, and multiple ways to store scraped data such as JSON files and MongoDB, all with clear code examples.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
Master Scrapy: Build Powerful Python Crawlers Step‑by‑Step

Hello everyone! I'm Linhero. This article follows my previous post on using XPath to crawl free proxy IPs and introduces the Scrapy framework.

Preface

One day while shopping I was distracted by a fruit vendor and ended up buying expensive fruit. To avoid such mistakes I decided to use Scrapy to crawl the Beijing Xinfadi market price data.

Scrapy Overview

Before crawling, let's learn what Scrapy is.

Scrapy is an asynchronous framework based on Twisted, a pure‑Python web‑crawling framework designed for extracting structured data. Its architecture is clear, modules are loosely coupled, and it is highly extensible, allowing you to fetch data with minimal code.

Scrapy Architecture

Below is the classic Scrapy architecture diagram:

The diagram shows many components. The following list summarizes each component's role:

Engine – Core engine that passes data and signals between modules (no code needed).

Scheduler – Stores requests from the engine and provides them back when needed (no code needed).

Downloader – Downloads web pages and returns the content to the engine (no code needed).

Spiders – Crawl spiders that process pages, extract data and URLs, and return them to the engine (code required).

Item Pipeline – Processes items for cleaning, validation, and storage (code required).

Downloader Middlewares – Bridge between engine and downloader, handling requests/responses; can customize extensions such as proxies (usually no code needed).

Spider Middlewares – Bridge between engine and spiders, handling input responses and output results (usually no code needed).

The Engine sits at the center, making it the core of the framework.

Only the Spiders and Item Pipeline components typically require you to write code.

Creating a Scrapy Project

After understanding the framework, create a new Scrapy project with the following command: scrapy startproject <project_name> For example, creating a project named test1 yields the following directory structure:

spiders – folder for spider files.

items.py – defines the data structures to be scraped.

middlewares.py – project middleware definitions.

pipelines.py – item pipeline definitions.

settings.py – project settings.

scrapy.cfg – Scrapy deployment configuration.

Spider Creation

Creating a Spider

Enter the project directory and run: scrapy genspider <spider_name> <allowed_domain> Using quotes.toscrape.com as an example, the generated firstspider.py looks like:

import scrapy

class FirstspiderSpider(scrapy.Spider):
    name = 'firstSpider'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        pass

Key points: FirstspiderSpider is a custom spider class inheriting from scrapy.Spider. name uniquely identifies the spider; run it with scrapy crawl <name>. allowed_domains restricts crawling to specified domains. start_urls defines the initial URLs. parse() processes responses and extracts data; the method name must remain unchanged.

Extracting Data in parse()

Example extraction code:

xpath_parse = response.xpath('/html/body/div[1]/div[2]/div[1]/div')
for xpath in xpath_parse:
    item = {}
    item['text'] = xpath.xpath('./span[1]/text()').extract_first().replace('“','').replace('”','')
    item['author'] = xpath.xpath('./span[2]/small/text()').extract_first()
    print(item)

Run the spider with: scrapy crawl firstSpider To reduce log noise, add the following to settings.py: LOG_LEVEL = "WARNING" You can also set a custom User-Agent in settings.py (see screenshot in the original article).

Defining Items in items.py

Define fields to avoid typos:

import scrapy
class Test1Item(scrapy.Item):
    text = scrapy.Field()
    author = scrapy.Field()

Import the item in the spider and replace item = {} with:

from test1.items import Test1Item
item = Test1Item()

Using different item classes helps distinguish data from various sources (e.g., JD, Taobao, Pinduoduo) when processing in pipelines.

Item Pipeline Overview

Item Pipelines handle tasks such as cleaning HTML, validating data, removing duplicates, and storing results in databases.

from itemadapter import ItemAdapter

class Test1Pipeline:
    def process_item(self, item, spider):
        return item

Enable the pipeline in settings.py:

ITEM_PIPELINES = {
    'test1.pipelines.Test1Pipeline': 300,
}

Pipeline priority is determined by the numeric value (lower means higher priority). Multiple pipelines can be defined and ordered as needed.

Passing Data to Pipelines

Yield the item in the spider: yield item Using yield turns the function into a generator, allowing Scrapy to process items one by one without loading everything into memory.

Implementing Pagination

Two common methods:

Override start_requests() to generate page URLs.

Generate new requests inside parse().

def start_requests(self):
    for i in range(1, 3):
        url = f'https://quotes.toscrape.com/page/{i}/'
        yield scrapy.Request(url=url, callback=self.parse)
for i in range(2, 3):
    url = f'https://quotes.toscrape.com/page/{i}/'
    yield scrapy.Request(url=url, callback=self.parse)

Both use scrapy.Request() with parameters such as url, callback, headers, cookies, meta, and dont_filter.

Saving Scraped Data

Save to files directly from the command line:

scrapy crawl <spider_name> -o output.json   # JSON file
scrapy crawl <spider_name> -o output.jl      # JSON Lines
scrapy crawl <spider_name> -o output.csv    # CSV file
scrapy crawl <spider_name> -o output.xml    # XML file

To store data in MongoDB, implement a pipeline:

from pymongo import MongoClient
client = MongoClient()
collection = client["test1"]["firstspider"]

class Test1Pipeline:
    def process_item(self, item, spider):
        collection.insert(item)
        return item

Conclusion

This article covered the essential components of the Scrapy framework, including its asynchronous architecture, project setup, spider creation, item definition, pipelines, pagination, and data storage options. With this knowledge you can start building your own crawlers and extend them to scrape various websites.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Data ExtractionPipelineScrapySpider
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.