Design and Implementation of Elastic-Job: A Distributed Job Scheduling Framework
Elastic-Job is a Java-based, decentralized distributed job scheduling framework that addresses limitations of existing solutions by providing features such as distributed coordination via Zookeeper, parallel task execution, elastic scaling, centralized management, customizable workflow tasks, and robust non‑functional requirements, with future plans for multi‑language support and enhanced monitoring.
Author Introduction
Liang Zhang, an architect at Dangdang.com and member of the Dangdang technology committee, focuses on architecture design, messaging middleware, and distributed systems. He leads the development and promotion of the Dangdang application framework ddframe, and the distributed job component elastic‑job has been open‑sourced.
Why Jobs Are Needed and Existing Problems
1. Why Use Jobs?
Jobs (scheduled tasks) are often interchangeable with messaging middleware, but they differ in scenarios such as time‑driven vs. event‑driven processing, batch vs. per‑item handling, real‑time vs. non‑real‑time requirements, and internal vs. decoupled system design.
2. Common Issues with Existing Job Systems
a) Quartz: focuses on scheduling, lacks data‑driven workflow and distributed parallel scheduling.
b) TBSchedule: outdated code, uses Timer (which has reliability issues), limited job types, and poor documentation.
c) Crontab: lacks distributed and centralized management.
Overall, current job systems miss distributed parallel scheduling, elastic scaling, centralized management, and customizable workflow capabilities.
Solution Approach
The two viable options are modifying an existing open‑source product or rebuilding on top of mature open‑source components. The chosen approach is to use existing open‑source building blocks and re‑package them into a new product named elastic‑job.
Elastic‑job adopts a decentralized design, using Zookeeper as a registration center, and adds elastic scaling and data sharding to the database‑based high‑availability model of Quartz.
1. Core Features
a) Distributed: replaces Quartz’s database‑based distribution with Zookeeper for registration.
b) Parallel Scheduling: tasks are split into independent shards executed in parallel across servers.
c) Elastic Scaling: when servers join or leave, re‑sharding occurs before the next execution.
d) Centralized Management: Zookeeper coordinates job status, assignment, and monitoring.
e) Customizable Workflow Tasks: supports simple tasks, data‑flow tasks (high‑throughput or ordered), with configurable threading and ordering similar to Kafka partitions.
2. Additional Features
a) Failover Transfer: idle servers can pick up orphan shards during execution.
b) Spring Namespace Support: optional integration with Spring.
c) Operations Platform: a web console for job management.
3. Non‑Functional Requirements
a) Stability: sharding remains stable unless server IP or job name changes.
b) High Performance: batch processing uses automatic splitting and multithreading.
c) Flexibility: features can be toggled to balance performance (e.g., Zookeeper write load).
d) Idempotency: ensures a shard runs on only one server at a time.
e) Fault Tolerance: job stops immediately if connection to Zookeeper is lost.
Implementation and Design Philosophy
Elastic‑job consists of modules for registration center, data sharding, distributed coordination, timed task handling, and customizable workflow tasks.
a) Decentralization : No central scheduler; each server is peer‑to‑peer, coordinated via the registration center. A master node handles tasks like sharding and cleanup but does not schedule jobs.
b) Registration Center : Currently Zookeeper, storing job configuration, server info, and runtime state. Future plans include supporting etcd or custom implementations.
c) Data Sharding : Maps real data to logical shards, stored in the registration center; servers fetch their assigned shards based on IP.
d) Distributed Coordination : Handles dynamic scaling, master election, and re‑sharding using Zookeeper’s temporary nodes and watchers.
e) Timed Task Processing : Uses Quartz for cron‑based triggers, with safeguards against duplicate execution and missed triggers, each task running in its own thread pool.
f) Customizable Workflow Tasks : Supports simple tasks, fetch/process data flows, and plans for message, file, and workflow tasks via plugins.
2. Deployment and Usage
Deploy the elastic‑job JAR/WAR and connect it to a shared Zookeeper registration center.
3. Open‑Source Development Philosophy
Emphasizes clean, readable code, high reuse, minimal duplication, thorough testing (95%+ coverage), modular abstraction, and clear documentation. Quality is prioritized over speed and cost.
Future Outlook
Plans include multi‑language support, richer monitoring (JMX, external systems via Flume), additional registration centers, task workflows, improved failover, more job types (file, MQ), and diverse sharding strategies.
Appendix: Origin of Elastic‑Job
Elastic‑job originated from Dangdang’s Java application framework ddframe (formerly dd‑job). It was open‑sourced to engage the community, while some core modules remain proprietary.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.