Big Data 14 min read

How Kuaishou Scales YARN to Tens of Thousands of Nodes with the Kwai Scheduler

This article explains how Kuaishou’s massive offline compute clusters—tens of thousands of machines processing hundreds of petabytes daily—are managed by a heavily customized YARN stack and the home‑grown Kwai Scheduler, detailing architecture, scheduler evolution, multi‑scenario optimizations, and future scaling plans.

dbaplus Community
dbaplus Community
dbaplus Community
How Kuaishou Scales YARN to Tens of Thousands of Nodes with the Kwai Scheduler

1. Kuaishou Data Scale

Kuaishou operates offline compute clusters with tens of thousands of machines, handling hundreds of petabytes of data each day and running millions of jobs, posing significant challenges for storage, computation, and scheduling.

2. Kuaishou Big Data Architecture

The stack consists of an HDFS/HBase storage layer, a YARN resource‑scheduling layer, and an execution layer built from Flink, MapReduce, Spark, Presto, TensorFlow and other engines, topped by application platforms such as Flink job hosting, machine‑learning, and SQL submission services.

3. YARN Resource Scheduling System

YARN, introduced with Hadoop 2.0, separates cluster‑wide resource management (ResourceManager) from per‑application management (ApplicationMaster), moving from a single‑level to a two‑level scheduler architecture.

Stage 1: Scheduling triggered by heartbeats, with scheduling and heartbeat logic sharing a thread, causing interference.

Stage 2: Scheduling logic moved to a separate thread, but still contended for a global lock and suffered from costly queue sorting.

Stage 3: Introduction of a global scheduler allowing concurrent queue processing; performance improved but fairness and sorting overhead remained issues.

Kuaishou initially used the Fair Scheduler in a single‑threaded mode, then deployed a custom Kwai Scheduler that supports plug‑in scheduling policies and scales to clusters of tens of thousands of nodes.

4. Kwai Scheduler Overview

The Kwai Scheduler replaces the Fair Scheduler’s logic while keeping the ResourceManager’s RPC and event layers unchanged. Each scheduling round snapshots the cluster state, performs batch concurrent scheduling, and pushes results back to the cluster; applications retrieve containers via the existing heartbeat interface.

The scheduler’s workflow (illustrated below) uses the cluster snapshot to pre‑allocate resources for each application, then lets applications compete concurrently for free nodes. High‑performance is achieved by avoiding lock contention and allowing multi‑threaded, per‑application scheduling.

Scheduling strategies are implemented via filter and score interfaces. The filter removes unsuitable nodes, while the score assigns a numeric value to each node; lower scores are preferred. For example, a task‑scattering strategy gives nodes with more allocated application resources a lower score, spreading tasks across the cluster.

5. Multi‑Scenario Optimizations

Offline ETL

Resource contention from non‑priority queues is mitigated by controlled pre‑emptive stealing, limited to high‑priority core jobs.

Virtual queues provide logical isolation within physical queues, guaranteeing resources for high‑priority jobs.

AppSlot pre‑emptive stealing releases slots occupied by low‑priority jobs when high‑priority jobs are pending too long.

Back‑track jobs are throttled by limiting total resources and pending app counts, and by back‑pressuring upstream workflow systems.

Reserve‑slot pre‑emption reclaims resources from low‑priority containers held by reserved nodes.

Abnormal nodes are identified via metrics (failure rate, shuffle failures, etc.) and excluded from scheduling to avoid long‑tail failures.

Ad‑hoc Query

Virtual queues are created per user, enabling fair user‑level resource distribution and user‑level pre‑emption to prevent a single user from monopolizing resources.

Machine‑Learning Training

Training jobs require “all‑or‑nothing” resources; a round‑robin (FIFO‑like) policy guarantees the head of the queue receives enough resources to avoid deadlock.

Flink Real‑Time Jobs

CPU‑balancing scheduling avoids hotspot nodes.

ApplicationMaster failure avoidance prevents scheduling onto nodes with known AM failures.

NodeManager suspension (NM‑pause) stops new tasks on problematic nodes, reducing task loss.

Hawk provides second‑level node failure detection for rapid job recovery.

Resource‑redundancy allocation removes the need for separate resource acquisition and jar download steps, achieving near‑second recovery times.

6. Future Work

Support for ultra‑large clusters (hundreds of thousands of nodes) using community‑driven federation.

Cross‑IDC Hadoop cluster construction to handle limited inter‑IDC bandwidth.

Mixed deployment of offline workloads on idle online machines, requiring isolation and scheduling improvements.

Unified management of offline resources across YARN (offline) and Kubernetes (online) for more elastic provisioning.

Stream‑shuffle service to convert random I/O in MR/Spark shuffle to sequential I/O, boosting cluster compute capacity.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataResource ManagementYARNCluster OptimizationKwai SchedulerLarge-Scale Scheduling
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.