EasyScheduler: An Open‑Source Big Data Workflow Scheduling System – Architecture and Design Overview
This article introduces EasyScheduler, an open‑source big data workflow scheduling system, explaining its core terminology, decentralized architecture, distributed lock implementation, thread‑shortage handling, fault‑tolerance mechanisms, task‑retry and priority designs, as well as its logging solution using Logback and gRPC.
Terminology
EasyScheduler uses a Directed Acyclic Graph (DAG) to assemble tasks, where each node represents a task instance. Key concepts include workflow definition (visual DAG), workflow instance, task instance, task types (SHELL, SQL, SUB_PROCESS, PROCEDURE, MR, SPARK, PYTHON, DEPENDENT), scheduling modes (cron‑based and manual), priority, email alerts, failure strategies, and data back‑fill.
System Architecture
MasterServer
Implements a decentralized design, handling DAG splitting, task submission, and health monitoring of other MasterServer and WorkerServer nodes. It registers a temporary node in ZooKeeper and contains components such as Distributed Quartz, MasterSchedulerThread (scans the command table), MasterExecThread (DAG splitting), and MasterTaskExecThread (task persistence).
WorkerServer
Executes tasks and provides log services. It also registers with ZooKeeper and maintains a heartbeat. Core components include FetchTaskThread (pulls tasks from the Task Queue) and LoggerServer (RPC service for log retrieval).
ZooKeeper
Provides cluster management, fault tolerance, event listening, and distributed locks for Master/Worker nodes. The Task Queue is also implemented on top of ZooKeeper.
Other Components
Alert: stores, queries, and notifies two types of alerts (email and SNMP – not yet implemented).
API: RESTful interfaces for creating, querying, modifying, publishing, and controlling workflows.
UI: Front‑end visual interface for workflow operations.
Design Philosophy
The system contrasts centralized (Master/Slave) and decentralized architectures. Centralized designs suffer from single‑point failures and load imbalance, while decentralized designs eliminate a unique manager, relying on ZooKeeper for leader election and distributed locks, similar to ZooKeeper or etcd.
Distributed Lock Practice
EasyScheduler uses ZooKeeper‑based distributed locks to ensure that only one Master executes the scheduler and only one Worker submits tasks at any time. The lock acquisition flow is illustrated with diagrams in the original article.
Thread‑Shortage and Circular Wait Problem
When a DAG contains many nested sub‑processes, threads may become exhausted, causing a circular wait where parent and child flows wait for each other. Three mitigation strategies are discussed, with the chosen solution being to introduce a special command type that pauses the main flow, freeing threads for pending sub‑flows.
Fault‑Tolerance Design
Fault tolerance covers service crash recovery (Master and Worker) and task retry. ZooKeeper watchers detect node removals and trigger recovery workflows. Master fault‑tolerance re‑submits tasks after verifying their status in the Task Queue; Worker fault‑tolerance follows a similar pattern. Task retry is configurable per business node, while logical nodes do not support retry.
Task Priority Design
Priorities are defined at the workflow instance level, workflow definition level, and task instance level, each with five levels (HIGHEST to LOWEST). Priority strings are stored in ZooKeeper and compared lexicographically to determine execution order.
Log Access via Logback and gRPC
Because UI and Workers may run on different machines, logs are accessed remotely via gRPC. A custom Logback FileAppender creates per‑task log files, and a TaskLogFilter selects log events whose thread name starts with "TaskLogInfo-". The relevant Java code is shown below.
/**
* task log appender
*/
public class TaskLogAppender extends FileAppender<ILoggingEvent> {
@Override
protected void append(ILoggingEvent event) {
if (currentlyActiveFile == null) {
currentlyActiveFile = getFile();
}
String threadName = event.getThreadName();
String[] threadNameArr = threadName.split("-");
String logId = threadNameArr[1]; // processDefineId_processInstanceId_taskInstanceId
// ... additional logic ...
super.subAppend(event);
}
} /**
* task log filter
*/
public class TaskLogFilter extends Filter<ILoggingEvent> {
@Override
public FilterReply decide(ILoggingEvent event) {
if (event.getThreadName().startsWith("TaskLogInfo-")) {
return FilterReply.ACCEPT;
}
return FilterReply.DENY;
}
}Conclusion
EasyScheduler provides a decentralized, ZooKeeper‑coordinated big data workflow scheduling solution with robust fault tolerance, flexible priority handling, and remote log access, making it suitable for large‑scale data processing pipelines.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
