Scrapy Tutorial: Installation, Components, Project Setup, Code Implementation, and Data Storage
This article provides a comprehensive step‑by‑step guide to installing Scrapy, understanding its core components and processing flow, creating a weather‑data crawling project, writing items, settings, middlewares, spiders, running the crawler, exporting results, and storing the scraped data into MongoDB.
Scrapy is a fast, high‑level Python framework for web crawling and data extraction. The tutorial begins with installing the required dependencies, including Twisted and Scrapy itself, using pip install scrapy .
It then explains Scrapy's architecture, listing its main components: Engine, Scheduler, Downloader, Spider, Item Pipeline, Downloader Middleware, Spider Middleware, and Scheduler Middleware, and describes the typical data flow among them.
The project analysis targets the AQI history data site (https://www.aqistudy.cn/historydata/). A new Scrapy project is created with scrapy startproject weather_spider , and a spider named weather is generated using scrapy genspider weather www.aqistudy.cn/historydata .
In items.py an WeatherSpiderItem is defined with fields for city, date, AQI, level, PM2.5, PM10, SO2, CO, NO2, and O3_8h. The settings.py file is updated with a list of user‑agent strings stored in MY_USER_AGENT and the middleware activation order (RandomUserAgentMiddleware at priority 900).
A RandomUserAgentMiddleware class selects a random user‑agent for each request. To handle pages rendered by JavaScript, a WeatherSpiderDownloaderMiddleware is implemented; when a request contains meta['selenium'] = True , it launches a headless Chrome instance via Selenium, retrieves the rendered HTML, and returns a scrapy.http.HtmlResponse .
The spider code includes three parsing methods:
parse : extracts city URLs and names from the start page and yields requests to parse_month .
parse_month : extracts the first five month URLs for a city, passes the city name and a selenium flag, and yields requests to parse_day_data .
parse_day_data : iterates over table rows, populates a WeatherSpiderItem with daily AQI data, and yields the item.
Running the project from the project root can be verified with scrapy list . The crawler can export data in JSON, JSON Lines, CSV, or XML formats, e.g., scrapy crawl weather -o spider.json . To ensure correct encoding, FEED_EXPORT_ENCODING = 'utf-8' is added to settings.py .
For persistent storage, a MongoDB pipeline is defined in pipelines.py . The pipeline connects to MongoDB using MONGO_URI and MONGO_DB settings, inserts each item into a collection named after the item class, and closes the connection when the spider finishes. The pipeline is activated via ITEM_PIPELINES = {'weather_spider.pipelines.MongoPipeline': 300} .
The article concludes that the presented Scrapy setup can be extended to crawl all cities' weather data from the target site, demonstrating a complete end‑to‑end workflow from installation to data persistence.
Full-Stack Internet Architecture
Introducing full-stack Internet architecture technologies centered on Java
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.