Master Scrapy: From Basics to Advanced Spider Development

This comprehensive guide introduces Scrapy's architecture, explains its core components and data flow, teaches XPath fundamentals, walks through installation, project creation, spider coding, item and pipeline definitions, middleware customization, pagination handling, and essential settings for effective Python web crawling.

21CTO
21CTO
21CTO
Master Scrapy: From Basics to Advanced Spider Development

1. Scrapy Overview

Scrapy is an event‑driven web‑crawling framework built on Twisted and written in pure Python. It consists of five core components—Engine, Scheduler, Downloader, Spiders, Item Pipelines—and two middleware hooks that coordinate the crawling process.

Engine: central controller that orchestrates events.

Scheduler: queues requests and manages concurrency.

Downloader: fetches web pages.

Spiders: generate requests and parse responses.

Item Pipelines: process extracted items (e.g., store data).

Middlewares: hook points between Engine, Spiders and Downloader.

To build a spider you only need to implement Spiders (what to crawl and how to parse) and Item Pipelines (how to handle the parsed data); the framework handles the rest.

1.2 Scrapy Data Flow

The data flow between components is illustrated below.

Spiders

send a request to Engine. Engine puts the request into Scheduler. Scheduler selects a request and passes it to Downloader. Downloader retrieves the page and returns it to Engine. Engine forwards the response to Spiders for parsing. Spiders yield Item objects. Item Pipelines receive items and perform persistence or other processing.

Key points: Scheduler controls concurrency; Engine, built on Twisted, uses callbacks to keep components loosely coupled.

2. Fundamentals: XPath

XPath is the primary syntax for extracting data from HTML/XML in Scrapy. Basic expressions include: a/b: child relationship. a//b: descendant relationship. //div[@class='container']: select elements with a specific attribute value. //a[contains(@id,'abc')]: select elements whose attribute contains a substring.

Example:

response.xpath('//div[@class="taglist"]/ul//li//a//img/@data-original').getall()

3. Installation

Install Scrapy and its dependencies with: pip install scrapy Key dependencies: lxml, parsel, w3lib, twisted, cryptography, pyOpenSSL.

4. Creating a Project

Generate a new project: scrapy startproject sexy Typical directory layout includes spiders, items.py, pipelines.py, and settings.py.

5. Building a Simple Spider

5.1 Spider implementation

A spider that downloads images from a site:

import scrapy, os, requests, time

def download_from_url(url):
    resp = requests.get(url, stream=True)
    if resp.status_code == requests.codes.ok:
        return resp.content
    print(f'{url}-{resp.status_code}')
    return None

class SexySpider(scrapy.Spider):
    name = 'sexy'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/tag/list.html']
    save_path = '/home/sexy/dingziku'

    def parse(self, response):
        img_list = response.xpath('//div[@class="taglist"]/ul//li//a//img/@data-original').getall()
        for img_url in img_list:
            file_name = img_url.split('/')[-1]
            content = download_from_url(img_url)
            if content:
                with open(os.path.join(self.save_path, file_name), 'wb') as fw:
                    fw.write(content)
        next_page = response.xpath('//div[@class="page both"]/ul/a[text()="下一页"]/@href').get()
        if next_page:
            yield scrapy.Request(response.urljoin(next_page), callback=self.parse)

5.2 Items and Pipelines

Define an item:

import scrapy
class SexyItem(scrapy.Item):
    img_url = scrapy.Field()

Pipeline that saves images:

import os, requests

def download_from_url(url):
    ...

class SexyPipeline:
    def __init__(self):
        self.save_path = '/tmp'

    def process_item(self, item, spider):
        if spider.name == 'sexy':
            img_url = item['img_url']
            file_name = img_url.split('/')[-1]
            content = download_from_url(img_url)
            if content:
                with open(os.path.join(self.save_path, file_name), 'wb') as fw:
                    fw.write(content)
        return item

Enable in settings.py:

ITEM_PIPELINES = {'sexy.pipelines.SexyPipeline': 300}

5.3 Automatic pagination

Extract the next‑page URL and yield a new request as shown in the spider code above.

5.4 Middleware example

Random User‑Agent middleware:

from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
import random
agents = [
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv,2.0.1) Gecko/20100101 Firefox/4.0.1',
    'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)'
]

class RandomUserAgent(UserAgentMiddleware):
    def process_request(self, request, spider):
        request.headers.setdefault('User-agent', random.choice(agents))

Activate in settings.py:

DOWNLOADER_MIDDLEWARES = {'sexy.middlewares.customUserAgent.RandomUserAgent': 20}

5.5 Common settings

ROBOTSTXT_OBEY = False

CONCURRENT_REQUESTS = 16

DOWNLOAD_DELAY = 0.5

Full settings reference: https://doc.scrapy.org/en/latest/topics/settings.html

6. Conclusion

After reading this guide you should be able to create a functional Scrapy spider, configure items, pipelines, and middlewares, and adjust common settings for reliable crawling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonmiddlewareWeb ScrapingScrapyXPathCrawler
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.