Design and Implementation of ZZSCHEDULE: A Distributed Task Scheduling Platform Based on XXL-JOB
This article presents the background, core features, system architecture, internal mechanisms, and practical extensions of ZZSCHEDULE, a distributed task scheduling platform built on the open‑source XXL‑JOB framework, detailing its design goals, HA strategies, task dependency handling, and deployment experiences at Zhuanzhuan.
Before ZZSCHEDULE, the in‑house task scheduler zzjob suffered from heavy clients, lack of service aggregation, missing timeout settings, inability to abort running tasks, no retry, no dependency, no logging, and no permission control. To address these issues, the architecture team developed ZZSCHEDULE based on the open‑source XXL‑JOB platform, aiming for rapid development, easy learning, lightweight footprint, and extensibility.
ZZSCHEDULE currently runs three environments (test, sandbox, production) and offers many features:
Simple web UI for CRUD operations, ready in minutes.
Dynamic task state changes, start/stop, and immediate termination.
HA for both the central scheduler (clustered deployment) and executors (clustered deployment).
Automatic executor registration and discovery.
Elastic scaling of executors.
Rich routing strategies (first, last, round‑robin, random, consistent hash, least‑used, failover, busy‑transfer, etc.).
Failover handling, blocking policies, timeout control, retry mechanisms, and shard‑broadcast tasks.
Real‑time progress monitoring, rolling logs, task dependencies, multiple trigger modes (Cron, dependency, manual), consistency via DB locks, custom parameters, and isolated thread pools for slow tasks.
Full‑asynchronous execution pipeline.
Additional extensions for Zhuanzhuan include:
Permission control integrated with Zhuanzhuan SSO, with service‑level owners managing tasks.
Failure alerts sent to enterprise WeChat.
One‑click synchronization of task configurations across test, sandbox, and production environments.
Ability to specify a list of executor instances for a task.
A Spring‑Boot starter that auto‑detects environment, registers the client, and injects the necessary beans, enabling developers to use only @XxlJob and @JobHandler annotations.
The system architecture follows the classic XXL‑JOB design: a web‑based scheduling center (the "center") manages job metadata, acquires MySQL locks, and dispatches tasks to executors via TCP (or RESTful API in newer versions). Executors run inside business processes, receive dispatches, execute the job logic, and report results back.
Key internal mechanisms:
Executor registration and discovery: Executors can register proactively (heartbeat) or be discovered passively by the scheduler.
Scheduling and execution flow: The scheduler loads jobs for the next five seconds, places them into a time‑wheel, and a dedicated thread scans the wheel each second, submitting jobs to fast or slow thread pools based on execution time statistics.
Multi‑scheduler HA: Multiple scheduler instances use a MySQL row‑level lock (schedule_lock) to ensure only one instance loads and updates jobs at a time. Example pseudo‑code: begin; select * from lock where lock_name='schedule_lock' for update; select * from job where next_time < now()+5; put jobs into timewheel; update jobs set next_time=xxx where id in (...); commit; sleep 5s;
Task dependency: After a job succeeds, the scheduler triggers dependent child jobs.
Lifecycle management: Tasks support timeout (via FutureTask), interruption (Thread.interrupt()), and retries, with a defined state flow.
Execution logs: Each dispatch generates a log file on the executor host (e.g., /opt/log/zzschedule/{service}/{yyyy‑MM‑dd}/{logId}.txt) and can be viewed in real time via the UI.
In practice, ZZSCHEDULE has been running since February 2020, serving 74 services, 276 executor instances, and over 325 jobs, handling more than 19 million dispatches (≈120 k per day). Lessons learned include verifying Cron expressions before starting jobs, understanding the limits of Thread.interrupt() for termination, and avoiding excessive logging in the scheduler.
Overall, ZZSCHEDULE demonstrates that a well‑designed, lightweight distributed scheduler can provide high availability, extensibility, and operational simplicity for large‑scale microservice environments.
Zhuanzhuan Tech
A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.