Design and Evolution of a Distributed Scheduling System for Real‑time Alerts in the Beidou Monitoring Platform
This article details the background, design choices, and architectural evolution of a distributed scheduling system—from a simple Redlock‑based implementation for real‑time alerts to a robust Bull‑powered task queue supporting complex scenarios, load balancing, persistence, and reliable execution across multiple Node.js servers.
The Beidou front‑end monitoring system consists of data collection (SDK), processing (Java), storage (Druid), analysis (Node.js) and presentation (React). To enable real‑time alerts, a scheduling component was needed on the Node.js layer.
Background : After the first phase, the platform could collect many metrics but lacked diverse data‑driven applications, prompting the addition of real‑time alerting.
Schedule 1.0 – Simple Scenario : Implemented a Redlock‑based distributed lock. Producers generate alert metrics; consumers acquire the lock, compute thresholds, and send notifications. This ensured a single execution per interval across the cluster, suitable for the limited early requirements.
Problems with 1.0 : As the system grew, issues emerged—uneven task distribution, high complexity with many tasks, lack of ordering, no persistence, and no retry mechanism.
Schedule 2.0 – Architecture Upgrade : Introduced a task‑queue layer using Bull (Redis‑backed) to provide priority, concurrency control, delayed jobs, rate limiting, pause/resume, repeatable jobs, atomic operations, persistence, and UI support. The new design separates producers (push jobs) and consumers (process jobs), enabling ordered, reliable, and scalable execution.
Producer Design : Uses node‑schedule to trigger jobs, generates a consistent JobId, and adds the job to Bull, relying on Redis lists for durability.
Consumer Design : Listens with BRPOPLPUSH, processes jobs concurrently, handles completion, failure, and final batch events, ensuring atomic state updates in Redis.
Result : The upgraded system now supports real‑time alerts, sampling analysis, data caching, weekly reports, and other scheduled tasks with high reliability and scalability, while remaining extensible for future challenges.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.