Big Data 14 min read

EasyScheduler: An Open‑Source Big Data Workflow Scheduling System – Architecture and Design Overview

This article introduces EasyScheduler, an open‑source big data workflow scheduling system, explaining its core terminology, decentralized architecture, distributed lock implementation, thread‑shortage handling, fault‑tolerance mechanisms, task‑retry and priority designs, as well as its logging solution using Logback and gRPC.

Architecture Digest
Architecture Digest
Architecture Digest
EasyScheduler: An Open‑Source Big Data Workflow Scheduling System – Architecture and Design Overview

Terminology

EasyScheduler uses a Directed Acyclic Graph (DAG) to assemble tasks, where each node represents a task instance. Key concepts include workflow definition (visual DAG), workflow instance, task instance, task types (SHELL, SQL, SUB_PROCESS, PROCEDURE, MR, SPARK, PYTHON, DEPENDENT), scheduling modes (cron‑based and manual), priority, email alerts, failure strategies, and data back‑fill.

System Architecture

MasterServer

Implements a decentralized design, handling DAG splitting, task submission, and health monitoring of other MasterServer and WorkerServer nodes. It registers a temporary node in ZooKeeper and contains components such as Distributed Quartz, MasterSchedulerThread (scans the command table), MasterExecThread (DAG splitting), and MasterTaskExecThread (task persistence).

WorkerServer

Executes tasks and provides log services. It also registers with ZooKeeper and maintains a heartbeat. Core components include FetchTaskThread (pulls tasks from the Task Queue) and LoggerServer (RPC service for log retrieval).

ZooKeeper

Provides cluster management, fault tolerance, event listening, and distributed locks for Master/Worker nodes. The Task Queue is also implemented on top of ZooKeeper.

Other Components

Alert: stores, queries, and notifies two types of alerts (email and SNMP – not yet implemented).

API: RESTful interfaces for creating, querying, modifying, publishing, and controlling workflows.

UI: Front‑end visual interface for workflow operations.

Design Philosophy

The system contrasts centralized (Master/Slave) and decentralized architectures. Centralized designs suffer from single‑point failures and load imbalance, while decentralized designs eliminate a unique manager, relying on ZooKeeper for leader election and distributed locks, similar to ZooKeeper or etcd.

Distributed Lock Practice

EasyScheduler uses ZooKeeper‑based distributed locks to ensure that only one Master executes the scheduler and only one Worker submits tasks at any time. The lock acquisition flow is illustrated with diagrams in the original article.

Thread‑Shortage and Circular Wait Problem

When a DAG contains many nested sub‑processes, threads may become exhausted, causing a circular wait where parent and child flows wait for each other. Three mitigation strategies are discussed, with the chosen solution being to introduce a special command type that pauses the main flow, freeing threads for pending sub‑flows.

Fault‑Tolerance Design

Fault tolerance covers service crash recovery (Master and Worker) and task retry. ZooKeeper watchers detect node removals and trigger recovery workflows. Master fault‑tolerance re‑submits tasks after verifying their status in the Task Queue; Worker fault‑tolerance follows a similar pattern. Task retry is configurable per business node, while logical nodes do not support retry.

Task Priority Design

Priorities are defined at the workflow instance level, workflow definition level, and task instance level, each with five levels (HIGHEST to LOWEST). Priority strings are stored in ZooKeeper and compared lexicographically to determine execution order.

Log Access via Logback and gRPC

Because UI and Workers may run on different machines, logs are accessed remotely via gRPC. A custom Logback FileAppender creates per‑task log files, and a TaskLogFilter selects log events whose thread name starts with "TaskLogInfo-". The relevant Java code is shown below.

/**
 * task log appender
 */
public class TaskLogAppender extends FileAppender<ILoggingEvent> {
    @Override
    protected void append(ILoggingEvent event) {
        if (currentlyActiveFile == null) {
            currentlyActiveFile = getFile();
        }
        String threadName = event.getThreadName();
        String[] threadNameArr = threadName.split("-");
        String logId = threadNameArr[1]; // processDefineId_processInstanceId_taskInstanceId
        // ... additional logic ...
        super.subAppend(event);
    }
}
/**
 * task log filter
 */
public class TaskLogFilter extends Filter<ILoggingEvent> {
    @Override
    public FilterReply decide(ILoggingEvent event) {
        if (event.getThreadName().startsWith("TaskLogInfo-")) {
            return FilterReply.ACCEPT;
        }
        return FilterReply.DENY;
    }
}

Conclusion

EasyScheduler provides a decentralized, ZooKeeper‑coordinated big data workflow scheduling solution with robust fault tolerance, flexible priority handling, and remote log access, making it suitable for large‑scale data processing pipelines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

DAGworkflowSchedulerloggingfault tolerance
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.