Comprehensive Scrapy Tutorial: Architecture, XPath Basics, Installation, Project Setup, and Advanced Features
This article provides a detailed walkthrough of Scrapy, covering its event‑driven architecture, component interactions, XPath parsing fundamentals, installation steps, project creation, sample spider code, item pipelines, middleware customization, and essential configuration settings for effective web crawling in Python.
Scrapy is an event‑driven web crawling framework built on the Twisted library and written entirely in Python, offering a modular architecture that simplifies large‑scale data extraction.
The core components include the Engine (the central controller), Scheduler (task queue), Downloader (fetches web pages), Spiders (defines requests and parses responses), Item Pipelines (processes extracted items), and Middlewares (hooks for request/response processing).
Data flows through these components in a series of steps: Spiders send requests to the Engine, the Engine queues them in the Scheduler, the Scheduler dispatches them to the Downloader, the Downloader retrieves the content and returns it to the Engine, which then passes the response back to the Spiders for parsing; parsed items are sent to the Item Pipelines for storage or further processing.
XPath is the primary syntax used for extracting data from HTML/XML; basic operators such as / , // , and attribute selectors @ enable precise node selection, e.g., //div[@class="taglist"]/ul//li//a//img/@data-original .
To install Scrapy, run pip install scrapy . The framework depends on packages like lxml , parsel , w3lib , twisted , and security libraries cryptography and pyOpenSSL .
Creating a new project is done with scrapy startproject myproject , which generates a directory structure containing scrapy.cfg , the project package, items.py , middlewares.py , pipelines.py , settings.py , and a spiders folder.
A simple spider example:
<code>import scrapy, os, requests, time
def download_from_url(url):
response = requests.get(url, stream=True)
if response.status_code == requests.codes.ok:
return response.content
else:
print(f"{url}-{response.status_code}")
return None
class SexySpider(scrapy.Spider):
name = 'sexy'
allowed_domains = ['example.com']
start_urls = ['http://example.com/tag/index.html']
save_path = '/home/sexy/dingziku'
def parse(self, response):
img_list = response.xpath('//div[@class="taglist"]/ul//li//a//img/@data-original').getall()
time.sleep(1)
for img_url in img_list:
file_name = img_url.split('/')[-1]
content = download_from_url(img_url)
if content:
with open(os.path.join(self.save_path, file_name), 'wb') as fw:
fw.write(content)
next_page = response.xpath('//div[@class="page both"]/ul/a[text()="下一页"]/@href').get()
if next_page:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
</code>Items are defined in items.py using scrapy.Field() , e.g., class SexyItem(scrapy.Item): img_url = scrapy.Field() . Pipelines in pipelines.py receive these items and can download files or store data, as shown in the SexyPipeline class.
Custom middlewares can modify requests or responses; an example RandomUserAgent middleware selects a random User‑Agent string and sets it in the request headers.
Important settings include enabling pipelines and middlewares in settings.py with priority values, disabling ROBOTSTXT_OBEY , adjusting CONCURRENT_REQUESTS , and setting DOWNLOAD_DELAY to avoid overloading target sites.
In summary, Scrapy’s modular design lets developers focus on writing spiders, items, pipelines, and middlewares while the framework handles scheduling, downloading, and concurrency, making it a powerful tool for Python‑based web crawling.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.