Design and Implementation of a Scrapy‑Based Web Crawling Platform
This article explains how to design a flexible web‑crawling platform using Scrapy, covering rule maintenance, job scheduling, asynchronous and real‑time crawlers, project setup, code structure, settings, local execution, deployment with scrapyd, API usage, and examples of Flask‑based real‑time services.
The article begins by outlining the essential components of a crawling platform: rule maintenance, job scheduler, support for both asynchronous (batch) and real‑time crawlers, and data output to files or Kafka for downstream processing.
It then introduces Scrapy as the primary framework for asynchronous crawling, showing how to install it with pip install scrapy and create a new project using scrapy startproject zj_scrapy. After navigating into the project directory ( cd zj_scrapy) a spider is generated with scrapy genspider sjqq "sj.qq.com".
The generated items.py defines the data fields, e.g.:
# -*- coding: utf-8 -*-
import scrapy
class ZjScrapyItem(scrapy.Item):
name = scrapy.Field()
passThe spider implementation ( spiders/sjqq.py) extracts app names from the target page and yields ZjScrapyItem objects:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import HtmlResponse
from zj_scrapy.items import ZjScrapyItem
class SjqqSpider(scrapy.Spider):
name = 'sjqq'
allowed_domains = ['sj.qq.com']
start_urls = ['https://sj.qq.com/myapp/category.htm?orgame=1&categoryId=114']
def parse(self, response: HtmlResponse):
name_list = response.xpath('/html/body/div[3]/div[2]/ul/li')
for each in name_list:
item = ZjScrapyItem()
name = each.xpath('./div/div/a[1]/text()').extract()
item['name'] = name[0]
yield itemA simple pipeline ( pipelines.py) prints each item and can access custom spider arguments:
# -*- coding: utf-8 -*-
class ZjScrapyPipeline(object):
def process_item(self, item, spider):
print("+++++++++++++++++++", item['name'])
print("-------------------", spider.cc)
return itemKey settings are demonstrated, such as configuring item pipelines, character encoding, concurrent requests, and default request headers:
ITEM_PIPELINES = {
'zj_scrapy.pipelines.ZjScrapyPipeline': 300,
}
FEED_EXPORT_ENCODING = 'utf-8'
CONCURRENT_REQUESTS = 32
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}Running the spider locally is done with scrapy crawl sjqq -o items.csv. Custom arguments can be passed (e.g., -a cc=scrapttest) and accessed in the pipeline via spider.cc.
For production deployment, the article shows how to install scrapyd and scrapyd-client, start the scrapyd service (default port 6800), and deploy the project using scrapyd-deploy. It also provides curl commands to schedule a spider, check daemon status, list projects, versions, spiders, jobs, and to cancel or delete jobs and projects.
The internal architecture of Scrapy is illustrated, describing the roles of Spiders, Engine, Scheduler, Downloader, Item Pipeline, Downloader Middlewares, and Spider Middlewares.
To handle real‑time crawling, the article switches to a synchronous approach using requests and BeautifulSoup, presenting a simple class that fetches app names from the same site.
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import time
class SyncCrawlSjqq(object):
def parser(self, url):
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
name_list = soup.find(class_='app-list clearfix')('li')
names = []
for name in name_list:
app_name = name.find('a', class_="name ofh").text
names.append(app_name)
return namesA Flask wrapper turns this logic into an HTTP service:
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
from flask import Flask, request, Response
import json
app = Flask(__name__)
class SyncCrawlSjqq(object):
def parser(self, url):
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
name_list = soup.find(class_='app-list clearfix')('li')
names = []
for name in name_list:
app_name = name.find('a', class_="name ofh").text
names.append(app_name)
return names
@app.route('/getSyncCrawlSjqqResult', methods=['GET'])
def getSyncCrawlSjqqResult():
syncCrawlSjqq = SyncCrawlSjqq()
return Response(json.dumps(syncCrawlSjqq.parser(request.args.get("url"))), mimetype="application/json")
if __name__ == '__main__':
app.run(port=3001, host='0.0.0.0', threaded=True)Finally, a multithreaded version using concurrent.futures.ThreadPoolExecutor demonstrates how to accelerate the synchronous crawler:
# -*- coding: utf-8 -*-
from concurrent.futures import ThreadPoolExecutor, wait, ALL_COMPLETED
import requests
from bs4 import BeautifulSoup
import time
class SyncCrawlSjqqMultiProcessing(object):
def parser(self, url):
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
name_list = soup.find(class_='app-list clearfix')('li')
names = []
for name in name_list:
app_name = name.find('a', class_="name ofh").text
names.append(app_name)
return names
if __name__ == '__main__':
url = "https://sj.qq.com/myapp/category.htm?orgame=1&categoryId=114"
executor = ThreadPoolExecutor(max_workers=20)
syncCrawl = SyncCrawlSjqqMultiProcessing()
t1 = time.time()
future_tasks = [executor.submit(print, syncCrawl.parser(url))]
wait(future_tasks, return_when=ALL_COMPLETED)
t2 = time.time()
print('一般方法,总共耗时:%s' % (t2 - t1))Overall, the article provides a comprehensive guide to building, configuring, running, and deploying both asynchronous Scrapy spiders and synchronous real‑time crawlers, complete with code examples and deployment instructions.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
