Big Data 11 min read

Design and Architecture of JD's Buffalo Distributed Workflow Scheduling System

This article examines JD's self‑developed Buffalo distributed workflow scheduling system for big‑data ETL, detailing its two‑layer entity model, instance‑based scheduling, high‑availability three‑layer architecture, performance optimizations, cold‑hot data separation, and open APIs to support massive, complex data pipelines.

JD Tech

Jul 23, 2024

Design and Architecture of JD's Buffalo Distributed Workflow Scheduling System

In big‑data processing, workflow task scheduling plays a critical role, requiring flexible orchestration, diverse scheduling policies, and high stability and efficiency. This article explores JD's self‑developed distributed workflow scheduling system, focusing on its key features and technical architecture.

Buffalo is JD's proprietary distributed DAG job scheduling system that provides offline job orchestration, debugging, monitoring, and DAG scheduling for data engineers, algorithm engineers, and analysts. Its goal is to deliver an industry‑leading, stable, efficient, user‑friendly ETL scheduling platform with comprehensive monitoring, containerized resources, and open capabilities.

The core challenges addressed include: (1) Complex business logic leading to intricate dependency graphs, where tasks may have hundreds or thousands of upstream/downstream links forming deep DAGs; (2) Massive scale and stringent stability/performance demands, with hundreds of thousands of tasks, millions of dependencies, and daily million‑level scheduling frequencies; (3) Rich data‑processing scenarios requiring support for various task types, execution modes, trigger rules, data passing, and back‑fill capabilities.

To meet these challenges, the system focuses on three aspects: usability, stability, and high performance.

Two‑layer entity model: The system adopts a dual‑layer model consisting of Action (the smallest execution unit containing script, parameters, environment, etc.) and Task (one or more actions with trigger rules forming a DAG; tasks can depend on each other, creating an outer DAG). This model offers stronger orchestration ability and flexibility compared with a single‑layer design.

Instance‑based scheduling: Task definitions are stateless; when a task reaches its execution cycle, a corresponding task instance is generated. Instances are executable, stateful objects. Benefits include stable periodic execution (no missing cycles) and clear, predictable dependency relationships.

Classification and hierarchical scheduling: The platform provides task classification and level‑based scheduling, ensuring that critical business tasks receive priority under resource constraints, with level information propagated to underlying clusters for guaranteed stability.

High‑availability architecture: Buffalo is divided into three layers, each built with high‑availability design: Manager layer – stateless task creation, management, and operations, horizontally scalable; High‑availability Scheduler – core scheduling engine using multi‑active + standby architecture, sharding tasks across nodes, handling idempotent state messages, and employing master‑slave mode for resource scheduling; Fault‑tolerant execution layer – responsible for task launch and execution, supporting both physical machines (workers/TaskNodes) and Kubernetes‑based containers, both with high‑availability features.

High performance: Horizontal scaling is achieved via a multi‑active scheduler that hashes tasks to distribute load across many services. An event‑driven model replaces traditional polling, reducing unnecessary computation. Memory scheduling and a master‑standby resource scheduler avoid distributed locks and external storage, boosting throughput. Cold‑hot data separation handles the massive growth of task instance data (over a million new instances daily), with strategies for data archiving, index tables, primary‑key based partition locating, and mechanisms to restore cold data to hot tables when operations are required.

Open capabilities: The system offers open HTTP APIs for task configuration, instance operations, status queries, and log retrieval, as well as asynchronous event notifications via JDQ to keep downstream systems synchronized.

Future plans: Buffalo will continue to iterate with enhancements such as containerization, plugin extensibility, richer open capabilities, and fine‑grained resource management, inviting user feedback to build a more stable, efficient, and user‑friendly scheduling platform.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data workflow High Availability Scheduling

Written by

JD Tech

Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.