Master Scrapy: Build, Deploy, and Scale a Python Web Crawler Platform
This guide walks through designing a full‑featured web‑crawler platform, covering rule maintenance, job scheduling, async and real‑time crawling with Scrapy, project setup, item pipelines, settings, local execution, custom parameters, server deployment via Scrapyd, API usage, and fast real‑time crawling with Requests, BeautifulSoup, Flask, and multithreading.
In this article we detail the design of a crawler platform that supports multiple crawling modes, requiring components such as rule maintenance, a job scheduler, and the ability to handle both asynchronous (batch) and real‑time crawlers. The crawled data can be exported to CSV/JSON files, sent to Kafka, and further processed with big‑data tools like Spark, Flink, or stored in databases.
The platform architecture is illustrated in the following diagrams:
Async Crawling with Scrapy
Install Scrapy: pip install scrapy Create a Scrapy project to crawl finance‑related apps from the Tencent App Store: scrapy startproject zj_scrapy Generate a spider: scrapy genspider sjqq "sj.qq.com" The generated project contains several key files:
items.py – defines the data fields, e.g., a name field.
spiders/ – contains the spider class inheriting from scrapy.Spider with name, allowed_domains, start_urls, and a parse method that extracts app names and yields Item objects.
pipelines.py – processes each Item, for example printing the name and a custom spider attribute.
settings.py – configures the pipeline, character encoding, concurrency, and request headers:
ITEM_PIPELINES = {'zj_scrapy.pipelines.ZjScrapyPipeline': 300}
FEED_EXPORT_ENCODING = 'utf-8'
CONCURRENT_REQUESTS = 32
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}Run the spider locally and export results to a CSV file: scrapy crawl sjqq -o items.csv Custom parameters can be passed with -a, e.g., scrapy crawl sjqq -o items.csv -a cc=scrapttest, and accessed in the pipeline via spider.cc.
Deploying Crawlers with Scrapyd
Install the server components:
pip install scrapyd
pip install scrapyd-clientStart Scrapyd (default port 6800) and access the UI at http://localhost:6800/. Deploy the project using scrapyd-deploy:
scrapyd-deploy <target> -p <project> --version <version>Key Scrapyd APIs: schedule.json – submit a spider job (e.g.,
curl http://localhost:6800/schedule.json -d project=zj_scrapy -d spider=sjqq). daemonstatus.json – check server status. addversion.json – upload a new project version. cancel.json – cancel a running job. listprojects.json, listversions.json, listspiders.json, listjobs.json – query projects, versions, spiders, and jobs. delversion.json and delproject.json – delete versions or entire projects.
Real‑Time Crawling with Requests & BeautifulSoup
For immediate data returns, combine requests and BeautifulSoup:
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import time
class SyncCrawlSjqq(object):
def parser(self, url):
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
name_list = soup.find(class_='app-list clearfix')('li')
names = []
for name in name_list:
app_name = name.find('a', class_="name ofh").text
names.append(app_name)
return names
if __name__ == '__main__':
sync = SyncCrawlSjqq()
t1 = time.time()
url = "https://sj.qq.com/myapp/category.htm?orgame=1&categoryId=114"
print(sync.parser(url))
t2 = time.time()
print('Time elapsed: %s' % (t2 - t1))Expose this logic as an HTTP service using Flask:
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
from flask import Flask, request, Response
import json
app = Flask(__name__)
class SyncCrawlSjqq(object):
def parser(self, url):
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
name_list = soup.find(class_='app-list clearfix')('li')
names = []
for name in name_list:
app_name = name.find('a', class_="name ofh").text
names.append(app_name)
return names
@app.route('/getSyncCrawlSjqqResult', methods=['GET'])
def get_result():
crawler = SyncCrawlSjqq()
return Response(json.dumps(crawler.parser(request.args.get("url"))), mimetype="application/json")
if __name__ == '__main__':
app.run(port=3001, host='0.0.0.0', threaded=True)Speeding Up Crawls with Multithreading
Use concurrent.futures.ThreadPoolExecutor to run multiple requests in parallel (e.g., 20 workers):
# -*- coding: utf-8 -*-
from concurrent.futures import ThreadPoolExecutor, wait, ALL_COMPLETED
import requests
from bs4 import BeautifulSoup
import time
class SyncCrawlSjqqMultiProcessing(object):
def parser(self, url):
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
name_list = soup.find(class_='app-list clearfix')('li')
names = []
for name in name_list:
app_name = name.find('a', class_="name ofh").text
names.append(app_name)
return names
if __name__ == '__main__':
url = "https://sj.qq.com/myapp/category.htm?orgame=1&categoryId=114"
executor = ThreadPoolExecutor(max_workers=20)
crawler = SyncCrawlSjqqMultiProcessing()
t1 = time.time()
future_tasks = [executor.submit(crawler.parser, url)]
wait(future_tasks, return_when=ALL_COMPLETED)
t2 = time.time()
print('Time elapsed: %s' % (t2 - t1))Multithreading significantly reduces the total crawling time compared with a single‑threaded approach.
The article concludes with a complete Scrapy internal architecture diagram:
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
