Design and Implementation of a Scrapy‑Based Web Crawling Platform

This article explains how to design a flexible web‑crawling platform using Scrapy, covering rule maintenance, job scheduling, asynchronous and real‑time crawlers, project setup, code structure, settings, local execution, deployment with scrapyd, API usage, and examples of Flask‑based real‑time services.

Architecture Digest
Architecture Digest
Architecture Digest
Design and Implementation of a Scrapy‑Based Web Crawling Platform

The article begins by outlining the essential components of a crawling platform: rule maintenance, job scheduler, support for both asynchronous (batch) and real‑time crawlers, and data output to files or Kafka for downstream processing.

It then introduces Scrapy as the primary framework for asynchronous crawling, showing how to install it with pip install scrapy and create a new project using scrapy startproject zj_scrapy. After navigating into the project directory ( cd zj_scrapy) a spider is generated with scrapy genspider sjqq "sj.qq.com".

The generated items.py defines the data fields, e.g.:

# -*- coding: utf-8 -*-
import scrapy

class ZjScrapyItem(scrapy.Item):
    name = scrapy.Field()
    pass

The spider implementation ( spiders/sjqq.py) extracts app names from the target page and yields ZjScrapyItem objects:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import HtmlResponse
from zj_scrapy.items import ZjScrapyItem

class SjqqSpider(scrapy.Spider):
    name = 'sjqq'
    allowed_domains = ['sj.qq.com']
    start_urls = ['https://sj.qq.com/myapp/category.htm?orgame=1&categoryId=114']

    def parse(self, response: HtmlResponse):
        name_list = response.xpath('/html/body/div[3]/div[2]/ul/li')
        for each in name_list:
            item = ZjScrapyItem()
            name = each.xpath('./div/div/a[1]/text()').extract()
            item['name'] = name[0]
            yield item

A simple pipeline ( pipelines.py) prints each item and can access custom spider arguments:

# -*- coding: utf-8 -*-
class ZjScrapyPipeline(object):
    def process_item(self, item, spider):
        print("+++++++++++++++++++", item['name'])
        print("-------------------", spider.cc)
        return item

Key settings are demonstrated, such as configuring item pipelines, character encoding, concurrent requests, and default request headers:

ITEM_PIPELINES = {
    'zj_scrapy.pipelines.ZjScrapyPipeline': 300,
}
FEED_EXPORT_ENCODING = 'utf-8'
CONCURRENT_REQUESTS = 32
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
}

Running the spider locally is done with scrapy crawl sjqq -o items.csv. Custom arguments can be passed (e.g., -a cc=scrapttest) and accessed in the pipeline via spider.cc.

For production deployment, the article shows how to install scrapyd and scrapyd-client, start the scrapyd service (default port 6800), and deploy the project using scrapyd-deploy. It also provides curl commands to schedule a spider, check daemon status, list projects, versions, spiders, jobs, and to cancel or delete jobs and projects.

The internal architecture of Scrapy is illustrated, describing the roles of Spiders, Engine, Scheduler, Downloader, Item Pipeline, Downloader Middlewares, and Spider Middlewares.

To handle real‑time crawling, the article switches to a synchronous approach using requests and BeautifulSoup, presenting a simple class that fetches app names from the same site.

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import time

class SyncCrawlSjqq(object):
    def parser(self, url):
        req = requests.get(url)
        soup = BeautifulSoup(req.text, "lxml")
        name_list = soup.find(class_='app-list clearfix')('li')
        names = []
        for name in name_list:
            app_name = name.find('a', class_="name ofh").text
            names.append(app_name)
        return names

A Flask wrapper turns this logic into an HTTP service:

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
from flask import Flask, request, Response
import json

app = Flask(__name__)

class SyncCrawlSjqq(object):
    def parser(self, url):
        req = requests.get(url)
        soup = BeautifulSoup(req.text, "lxml")
        name_list = soup.find(class_='app-list clearfix')('li')
        names = []
        for name in name_list:
            app_name = name.find('a', class_="name ofh").text
            names.append(app_name)
        return names

@app.route('/getSyncCrawlSjqqResult', methods=['GET'])
def getSyncCrawlSjqqResult():
    syncCrawlSjqq = SyncCrawlSjqq()
    return Response(json.dumps(syncCrawlSjqq.parser(request.args.get("url"))), mimetype="application/json")

if __name__ == '__main__':
    app.run(port=3001, host='0.0.0.0', threaded=True)

Finally, a multithreaded version using concurrent.futures.ThreadPoolExecutor demonstrates how to accelerate the synchronous crawler:

# -*- coding: utf-8 -*-
from concurrent.futures import ThreadPoolExecutor, wait, ALL_COMPLETED
import requests
from bs4 import BeautifulSoup
import time

class SyncCrawlSjqqMultiProcessing(object):
    def parser(self, url):
        req = requests.get(url)
        soup = BeautifulSoup(req.text, "lxml")
        name_list = soup.find(class_='app-list clearfix')('li')
        names = []
        for name in name_list:
            app_name = name.find('a', class_="name ofh").text
            names.append(app_name)
        return names

if __name__ == '__main__':
    url = "https://sj.qq.com/myapp/category.htm?orgame=1&categoryId=114"
    executor = ThreadPoolExecutor(max_workers=20)
    syncCrawl = SyncCrawlSjqqMultiProcessing()
    t1 = time.time()
    future_tasks = [executor.submit(print, syncCrawl.parser(url))]
    wait(future_tasks, return_when=ALL_COMPLETED)
    t2 = time.time()
    print('一般方法,总共耗时:%s' % (t2 - t1))

Overall, the article provides a comprehensive guide to building, configuring, running, and deploying both asynchronous Scrapy spiders and synchronous real‑time crawlers, complete with code examples and deployment instructions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonDeploymentFlaskAsyncScrapyWeb CrawlingScrapyd
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.