Scrapy Tutorial: Installation, Components, Project Setup, Code Implementation, and Data Storage

This article provides a comprehensive step‑by‑step guide to installing Scrapy, understanding its core components and processing flow, creating a weather‑data crawling project, writing items, settings, middlewares, spiders, running the crawler, exporting results, and storing the scraped data into MongoDB.

Full-Stack Internet Architecture
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Scrapy Tutorial: Installation, Components, Project Setup, Code Implementation, and Data Storage

Scrapy is a fast, high‑level Python framework for web crawling and data extraction. The tutorial begins with installing the required dependencies, including Twisted and Scrapy itself, using pip install scrapy.

It then explains Scrapy's architecture, listing its main components: Engine, Scheduler, Downloader, Spider, Item Pipeline, Downloader Middleware, Spider Middleware, and Scheduler Middleware, and describes the typical data flow among them.

The project analysis targets the AQI history data site (https://www.aqistudy.cn/historydata/). A new Scrapy project is created with scrapy startproject weather_spider, and a spider named weather is generated using scrapy genspider weather www.aqistudy.cn/historydata.

In items.py an WeatherSpiderItem is defined with fields for city, date, AQI, level, PM2.5, PM10, SO2, CO, NO2, and O3_8h. The settings.py file is updated with a list of user‑agent strings stored in MY_USER_AGENT and the middleware activation order (RandomUserAgentMiddleware at priority 900).

A RandomUserAgentMiddleware class selects a random user‑agent for each request. To handle pages rendered by JavaScript, a WeatherSpiderDownloaderMiddleware is implemented; when a request contains meta['selenium'] = True, it launches a headless Chrome instance via Selenium, retrieves the rendered HTML, and returns a scrapy.http.HtmlResponse.

The spider code includes three parsing methods: parse: extracts city URLs and names from the start page and yields requests to parse_month. parse_month: extracts the first five month URLs for a city, passes the city name and a selenium flag, and yields requests to parse_day_data. parse_day_data: iterates over table rows, populates a WeatherSpiderItem with daily AQI data, and yields the item.

Running the project from the project root can be verified with scrapy list. The crawler can export data in JSON, JSON Lines, CSV, or XML formats, e.g., scrapy crawl weather -o spider.json. To ensure correct encoding, FEED_EXPORT_ENCODING = 'utf-8' is added to settings.py.

For persistent storage, a MongoDB pipeline is defined in pipelines.py. The pipeline connects to MongoDB using MONGO_URI and MONGO_DB settings, inserts each item into a collection named after the item class, and closes the connection when the spider finishes. The pipeline is activated via

ITEM_PIPELINES = {'weather_spider.pipelines.MongoPipeline': 300}

.

The article concludes that the presented Scrapy setup can be extended to crawl all cities' weather data from the target site, demonstrating a complete end‑to‑end workflow from installation to data persistence.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonMongoDBWeb ScrapingScrapySeleniumCrawler
Full-Stack Internet Architecture
Written by

Full-Stack Internet Architecture

Introducing full-stack Internet architecture technologies centered on Java

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.