Master Scrapy: From Basics to Advanced Spider Development
This comprehensive guide introduces Scrapy's architecture, explains its core components and data flow, teaches XPath fundamentals, walks through installation, project creation, spider coding, item and pipeline definitions, middleware customization, pagination handling, and essential settings for effective Python web crawling.
1. Scrapy Overview
Scrapy is an event‑driven web‑crawling framework built on Twisted and written in pure Python. It consists of five core components—Engine, Scheduler, Downloader, Spiders, Item Pipelines—and two middleware hooks that coordinate the crawling process.
Engine: central controller that orchestrates events.
Scheduler: queues requests and manages concurrency.
Downloader: fetches web pages.
Spiders: generate requests and parse responses.
Item Pipelines: process extracted items (e.g., store data).
Middlewares: hook points between Engine, Spiders and Downloader.
To build a spider you only need to implement Spiders (what to crawl and how to parse) and Item Pipelines (how to handle the parsed data); the framework handles the rest.
1.2 Scrapy Data Flow
The data flow between components is illustrated below.
Spiderssend a request to Engine. Engine puts the request into Scheduler. Scheduler selects a request and passes it to Downloader. Downloader retrieves the page and returns it to Engine. Engine forwards the response to Spiders for parsing. Spiders yield Item objects. Item Pipelines receive items and perform persistence or other processing.
Key points: Scheduler controls concurrency; Engine, built on Twisted, uses callbacks to keep components loosely coupled.
2. Fundamentals: XPath
XPath is the primary syntax for extracting data from HTML/XML in Scrapy. Basic expressions include: a/b: child relationship. a//b: descendant relationship. //div[@class='container']: select elements with a specific attribute value. //a[contains(@id,'abc')]: select elements whose attribute contains a substring.
Example:
response.xpath('//div[@class="taglist"]/ul//li//a//img/@data-original').getall()3. Installation
Install Scrapy and its dependencies with: pip install scrapy Key dependencies: lxml, parsel, w3lib, twisted, cryptography, pyOpenSSL.
4. Creating a Project
Generate a new project: scrapy startproject sexy Typical directory layout includes spiders, items.py, pipelines.py, and settings.py.
5. Building a Simple Spider
5.1 Spider implementation
A spider that downloads images from a site:
import scrapy, os, requests, time
def download_from_url(url):
resp = requests.get(url, stream=True)
if resp.status_code == requests.codes.ok:
return resp.content
print(f'{url}-{resp.status_code}')
return None
class SexySpider(scrapy.Spider):
name = 'sexy'
allowed_domains = ['example.com']
start_urls = ['http://example.com/tag/list.html']
save_path = '/home/sexy/dingziku'
def parse(self, response):
img_list = response.xpath('//div[@class="taglist"]/ul//li//a//img/@data-original').getall()
for img_url in img_list:
file_name = img_url.split('/')[-1]
content = download_from_url(img_url)
if content:
with open(os.path.join(self.save_path, file_name), 'wb') as fw:
fw.write(content)
next_page = response.xpath('//div[@class="page both"]/ul/a[text()="下一页"]/@href').get()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)5.2 Items and Pipelines
Define an item:
import scrapy
class SexyItem(scrapy.Item):
img_url = scrapy.Field()Pipeline that saves images:
import os, requests
def download_from_url(url):
...
class SexyPipeline:
def __init__(self):
self.save_path = '/tmp'
def process_item(self, item, spider):
if spider.name == 'sexy':
img_url = item['img_url']
file_name = img_url.split('/')[-1]
content = download_from_url(img_url)
if content:
with open(os.path.join(self.save_path, file_name), 'wb') as fw:
fw.write(content)
return itemEnable in settings.py:
ITEM_PIPELINES = {'sexy.pipelines.SexyPipeline': 300}5.3 Automatic pagination
Extract the next‑page URL and yield a new request as shown in the spider code above.
5.4 Middleware example
Random User‑Agent middleware:
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
import random
agents = [
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv,2.0.1) Gecko/20100101 Firefox/4.0.1',
'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)'
]
class RandomUserAgent(UserAgentMiddleware):
def process_request(self, request, spider):
request.headers.setdefault('User-agent', random.choice(agents))Activate in settings.py:
DOWNLOADER_MIDDLEWARES = {'sexy.middlewares.customUserAgent.RandomUserAgent': 20}5.5 Common settings
ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 0.5
Full settings reference: https://doc.scrapy.org/en/latest/topics/settings.html
6. Conclusion
After reading this guide you should be able to create a functional Scrapy spider, configure items, pipelines, and middlewares, and adjust common settings for reliable crawling.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
