Scrapy Tutorial: Installation, Project Structure, Basic Usage, and Real‑World Example
This article provides a comprehensive, step‑by‑step guide to the Scrapy web‑crawling framework, covering its core components, installation methods, project layout, spider creation, data extraction techniques, pagination handling, pipeline configuration, and how to run the crawler to collect and store data.
Scrapy is a fast, high‑level Python framework for web crawling and data extraction, allowing developers to build spiders with minimal code.
The framework consists of several components: Engine, Scheduler, Downloader, Spider, Item, Pipeline, Downloader Middlewares, Spider Middlewares, and Scheduler Middlewares.
Installation can be performed via pip:
$ pip install scrapyor by downloading the package first:
$ pip download scrapy -d ./
# Using a domestic mirror
$ pip download -i https://pypi.tuna.tsinghua.edu.cn/simple scrapy -d ./After downloading, install the wheel:
$ pip install Scrapy-1.5.0-py2.py3-none-any.whlCreating a project is done with:
scrapy startproject mySpiderGenerate a spider:
scrapy genspider demo "demo.cn"Typical workflow includes four steps: creating a project, generating a spider, extracting data (e.g., using XPath or CSS selectors), and saving the data via pipelines.
Running a spider can be done from the command line:
scrapy crawl qb # qb is the spider nameor programmatically in PyCharm:
from scrapy import cmdline
cmdline.execute("scrapy crawl qb".split())The project directory contains configuration files such as scrapy.cfg , the Python module folder mySpider/ , items.py , pipelines.py , settings.py , and the spiders/ directory where spider code resides.
Example items.py definition:
import scrapy
class MyspiderItem(scrapy.Item):
pass # define fields hereExample pipelines.py for CSV output:
from itemadapter import ItemAdapter
import csv
class MyspiderPipeline:
def __init__(self):
self.f = open('Zcool.csv', 'w', encoding='utf-8', newline='')
self.writer = csv.DictWriter(self.f, fieldnames=['imgLink','title','types','vistor','comment','likes'])
self.writer.writeheader()
def process_item(self, item, spider):
self.writer.writerow(dict(item))
return item
def close_spider(self, spider):
self.f.close()A sample spider that extracts items from ZCOOL:
import scrapy
class DbSpider(scrapy.Spider):
name = 'db'
allowed_domains = ['douban.com']
start_urls = ['http://douban.com/']
def parse(self, response):
pass # extraction logic hereData extraction uses selectors, e.g., response.xpath() or response.css() , with methods like extract() , extract_first() , get() , and getall() .
Pagination can be handled by following the "next" link:
next_href = response.xpath("//a[@class='laypage_next']/@href").extract_first()
if next_href:
next_url = response.urljoin(next_href)
yield scrapy.Request(next_url)Alternatively, construct URLs manually using a page counter.
Running the crawler via a helper script ( start.py ) simplifies execution:
from scrapy import cmdline
cmdline.execute('scrapy crawl zc'.split())After execution, the scraped data is saved to Zcool.csv , confirming successful collection.
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.