Operations 12 min read

How JD.com’s Buffalo Scheduler Achieves High‑Performance, Scalable DAG Orchestration

Buffalo, JD.com’s in‑house distributed DAG scheduler, tackles massive task volumes and complex dependencies through a dual‑layer entity model, instance‑based execution, tiered scheduling, high‑availability architecture, event‑driven processing, in‑memory and cold‑hot data separation, delivering scalable, low‑latency ETL orchestration.

JD Cloud Developers

Jul 24, 2024

How JD.com’s Buffalo Scheduler Achieves High‑Performance, Scalable DAG Orchestration

1. Introduction

Buffalo Scheduler is a distributed DAG job scheduling system developed by JD.com, providing offline job orchestration, debugging, monitoring, and resource containerization for data engineers, algorithm engineers, and analysts.

The core challenges include complex business dependencies forming massive DAGs, high volume and strict stability/performance requirements, and diverse data processing scenarios demanding rich scheduling capabilities.

2. Core Technical Solutions

1. Entity and Orchestration Model

a) Dual‑layer Entity Model

The model defines two core concepts:

Action (step): The smallest execution unit containing script, parameters, environment, etc.

Task: A DAG composed of one or more actions plus trigger rules; tasks can depend on each other, forming an outer DAG for dual‑layer scheduling.

Compared with a single‑layer model, this offers stronger orchestration capability and better flexibility.

b) Instance‑Based Scheduling

Task definitions are stateless configurations. When a task reaches its run cycle, an instance is created (instantiation). Each instance is a snapshot that can be executed and holds state.

Advantages:

Stable cycles: Every cycle generates an instance, avoiding missing executions.

Clear dependencies: Instance dependencies are explicit, enabling quick traceability and repair.

c) Classification‑Based Scheduling

Tasks are classified by importance; higher‑priority tasks receive resource guarantees during contention, and classification information propagates to the underlying compute clusters for tailored protection strategies.

2. High‑Availability Architecture

a) Manager Layer: Stateless management services for task creation, management, and operations, horizontally scalable.

b) High‑Availability Scheduler (NameNode): Core engine handling instance generation, dual‑layer DAG scheduling, resource allocation, and state processing. Implements active‑active + standby architecture with sharding and idempotent handling; resource scheduler uses standby mode for resilience.

c) Fault‑Tolerant Execution Layer: Executes tasks on physical machines or containerized k8s pods. Workers run as long‑lived processes with message retransmission and cgroup isolation; k8s pods provide short‑lived, inherently highly available execution.

3. High Performance

1) Horizontal Scaling

The scheduler’s active‑active design distributes load via data hash sharding, enabling horizontal scaling across multiple services.

2) Event‑Driven Execution

a) Timed Polling: Traditional approach scans all pending instances, leading to high traversal cost and many unnecessary checks.

b) Event‑Driven: Triggers condition checks only when dependent states change, avoiding full scans and allowing asynchronous parallel processing for different event types.

3) In‑Memory Scheduling

By placing the resource scheduler in a primary‑standby mode with the active scheduler, all resource information resides in memory, eliminating distributed locks and external storage bottlenecks, dramatically improving performance.

4) Cold‑Hot Data Separation

Task instances generate massive data (≈1 million new records daily). Since tasks are periodic, completed instances become static (cold data) while active instances remain hot. Cold data is stored separately with indexing tables to enable fast location and occasional operations (e.g., re‑run) by moving data back to hot tables when needed.

3. Open Capabilities

Open API: HTTP‑based interfaces for task configuration, instance operations, status and log queries, exposed via JD’s internal service gateway.

Open Events: Asynchronous JDQ messages broadcast task and instance state changes for downstream business integration.

4. Future Roadmap

Buffalo will continue to enhance user experience, performance, containerization, plugin extensibility, open interfaces, and fine‑grained resource management, inviting community feedback for further stability and efficiency improvements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems High Availability Resource Management DAG scheduling ETL orchestration

Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.