Backend Development 16 min read

Master Scrapy: Build, Deploy, and Scale a Python Web Crawler Platform

This guide walks through designing a full‑featured web‑crawler platform, covering rule maintenance, job scheduling, async and real‑time crawling with Scrapy, project setup, item pipelines, settings, local execution, custom parameters, server deployment via Scrapyd, API usage, and fast real‑time crawling with Requests, BeautifulSoup, Flask, and multithreading.

21CTO

Aug 16, 2019

Master Scrapy: Build, Deploy, and Scale a Python Web Crawler Platform

In this article we detail the design of a crawler platform that supports multiple crawling modes, requiring components such as rule maintenance, a job scheduler, and the ability to handle both asynchronous (batch) and real‑time crawlers. The crawled data can be exported to CSV/JSON files, sent to Kafka, and further processed with big‑data tools like Spark, Flink, or stored in databases.

The platform architecture is illustrated in the following diagrams:

Async Crawling with Scrapy

Install Scrapy: pip install scrapy Create a Scrapy project to crawl finance‑related apps from the Tencent App Store: scrapy startproject zj_scrapy Generate a spider: scrapy genspider sjqq "sj.qq.com" The generated project contains several key files:

items.py – defines the data fields, e.g., a name field.

spiders/ – contains the spider class inheriting from scrapy.Spider with name, allowed_domains, start_urls, and a parse method that extracts app names and yields Item objects.

pipelines.py – processes each Item, for example printing the name and a custom spider attribute.

settings.py – configures the pipeline, character encoding, concurrency, and request headers:

ITEM_PIPELINES = {'zj_scrapy.pipelines.ZjScrapyPipeline': 300}
FEED_EXPORT_ENCODING = 'utf-8'
CONCURRENT_REQUESTS = 32
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
}

Run the spider locally and export results to a CSV file: scrapy crawl sjqq -o items.csv Custom parameters can be passed with -a, e.g., scrapy crawl sjqq -o items.csv -a cc=scrapttest, and accessed in the pipeline via spider.cc.

Deploying Crawlers with Scrapyd

Install the server components:

pip install scrapyd
pip install scrapyd-client

Start Scrapyd (default port 6800) and access the UI at http://localhost:6800/. Deploy the project using scrapyd-deploy:

scrapyd-deploy <target> -p <project> --version <version>

Key Scrapyd APIs: schedule.json – submit a spider job (e.g.,

curl http://localhost:6800/schedule.json -d project=zj_scrapy -d spider=sjqq

). daemonstatus.json – check server status. addversion.json – upload a new project version. cancel.json – cancel a running job. listprojects.json, listversions.json, listspiders.json, listjobs.json – query projects, versions, spiders, and jobs. delversion.json and delproject.json – delete versions or entire projects.

Real‑Time Crawling with Requests & BeautifulSoup

For immediate data returns, combine requests and BeautifulSoup:

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import time

class SyncCrawlSjqq(object):
    def parser(self, url):
        req = requests.get(url)
        soup = BeautifulSoup(req.text, "lxml")
        name_list = soup.find(class_='app-list clearfix')('li')
        names = []
        for name in name_list:
            app_name = name.find('a', class_="name ofh").text
            names.append(app_name)
        return names

if __name__ == '__main__':
    sync = SyncCrawlSjqq()
    t1 = time.time()
    url = "https://sj.qq.com/myapp/category.htm?orgame=1&categoryId=114"
    print(sync.parser(url))
    t2 = time.time()
    print('Time elapsed: %s' % (t2 - t1))

Expose this logic as an HTTP service using Flask:

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
from flask import Flask, request, Response
import json

app = Flask(__name__)

class SyncCrawlSjqq(object):
    def parser(self, url):
        req = requests.get(url)
        soup = BeautifulSoup(req.text, "lxml")
        name_list = soup.find(class_='app-list clearfix')('li')
        names = []
        for name in name_list:
            app_name = name.find('a', class_="name ofh").text
            names.append(app_name)
        return names

@app.route('/getSyncCrawlSjqqResult', methods=['GET'])
def get_result():
    crawler = SyncCrawlSjqq()
    return Response(json.dumps(crawler.parser(request.args.get("url"))), mimetype="application/json")

if __name__ == '__main__':
    app.run(port=3001, host='0.0.0.0', threaded=True)

Speeding Up Crawls with Multithreading

Use concurrent.futures.ThreadPoolExecutor to run multiple requests in parallel (e.g., 20 workers):

# -*- coding: utf-8 -*-
from concurrent.futures import ThreadPoolExecutor, wait, ALL_COMPLETED
import requests
from bs4 import BeautifulSoup
import time

class SyncCrawlSjqqMultiProcessing(object):
    def parser(self, url):
        req = requests.get(url)
        soup = BeautifulSoup(req.text, "lxml")
        name_list = soup.find(class_='app-list clearfix')('li')
        names = []
        for name in name_list:
            app_name = name.find('a', class_="name ofh").text
            names.append(app_name)
        return names

if __name__ == '__main__':
    url = "https://sj.qq.com/myapp/category.htm?orgame=1&categoryId=114"
    executor = ThreadPoolExecutor(max_workers=20)
    crawler = SyncCrawlSjqqMultiProcessing()
    t1 = time.time()
    future_tasks = [executor.submit(crawler.parser, url)]
    wait(future_tasks, return_when=ALL_COMPLETED)
    t2 = time.time()
    print('Time elapsed: %s' % (t2 - t1))

Multithreading significantly reduces the total crawling time compared with a single‑threaded approach.

The article concludes with a complete Scrapy internal architecture diagram:

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python multithreading Flask async Scrapy web crawling

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.