Run All Scrapy Spiders Together and Fix Video Download Errors
This guide shows how to create a custom Scrapy command to launch every spider at once, separate each spider's settings for better modularity, and resolve video download problems by adjusting request headers and handling file saving correctly.
1. Launch All Spiders with a Custom Command
Define a new Scrapy command crawlall that iterates over the list of registered spiders and starts them sequentially. The command is placed in crawlall.py and referenced via COMMANDS_MODULE in settings.py.
from scrapy.commands import ScrapyCommand
class Command(ScrapyCommand):
requires_project = True
def syntax(self):
return '[options]'
def short_desc(self):
return 'Runs all of the spiders'
def run(self, args, opts):
spider_list = self.crawler_process.spiders.list()
for name in spider_list:
self.crawler_process.crawl(name, **opts.__dict__)
self.crawler_process.start()Execute the command with a small wrapper script:
from scrapy.cmdline import execute
execute('scrapy crawlall'.split())2. Separate Settings for Each Spider
Move spider‑specific configuration into custom_settings inside each spider file. Example settings include request headers, Redis‑based scheduler and dupefilter, download delay, and Redis connection URL.
custom_settings = {
'DEFAULT_REQUEST_HEADERS': {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "zh,zh-CN;q=0.9",
"Cache-Control": "max-age=0",
"Connection": "keep-alive",
"Host": "www.baikemy.com",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36"
},
'DOWNLOADER_MIDDLEWARES': {},
'SCHEDULER': "scrapy_redis.scheduler.Scheduler",
'DUPEFILTER_CLASS': "scrapy_redis.dupefilter.RFPDupeFilter",
'REDIS_URL': "redis://@192.168.2.196:6379",
'SCHEDULER_QUEUE_CLASS': "scrapy_redis.queue.SpiderPriorityQueue",
'DOWNLOAD_DELAY': 0.3,
}3. Fix Video Download That Results in Unplayable Files
The issue is solved by adding cache‑bypass headers ( Pragma: no-cache and Cache-Control: no-cache) and ensuring the response body is written correctly to a .mp4 file.
headers = {
"Accept": "*/*",
"Accept-Encoding": "identity;q=1, *;q=0",
"Accept-Language": "zh,zh-CN;q=0.9",
"Connection": "keep-alive",
"Cache-Control": "no-cache",
"Host": "v.baikemy.com",
"Pragma": "no-cache",
"Range": "bytes=0-",
"Referer": meta["video_source"],
"Sec-Fetch-Mode": "no-cors",
"Sec-Fetch-Site": "same-origin",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36"
}
video_dir = r'e:\baikemy\video'
meta['video_location'] = meta['first_level'] + '\\' + meta['second_level'] + '\\' + meta['disease_name'] + '\\' + meta['title'] + '.mp4'
video_filepath = os.path.join(video_dir, meta['video_location'])
if os.path.isfile(video_filepath):
logging.info('[视频]已存在')
else:
if not os.path.exists(os.path.dirname(video_filepath)):
os.makedirs(os.path.dirname(video_filepath))
data = response.body
with open(video_filepath, 'wb') as f:
logging.info('[视频][正在下载]: ' + meta['title'])
f.write(data)
logging.info('[视频][下载完成]: ' + meta['title'] + '
')These steps ensure all spiders are started with a single command, each spider maintains its own configuration, and downloaded video files are saved correctly and are playable.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
