How Kuaishou Scales YARN to Tens of Thousands of Nodes with the Kwai Scheduler
This article explains how Kuaishou’s massive offline compute clusters—tens of thousands of machines processing hundreds of petabytes daily—are managed by a heavily customized YARN stack and the home‑grown Kwai Scheduler, detailing architecture, scheduler evolution, multi‑scenario optimizations, and future scaling plans.
1. Kuaishou Data Scale
Kuaishou operates offline compute clusters with tens of thousands of machines, handling hundreds of petabytes of data each day and running millions of jobs, posing significant challenges for storage, computation, and scheduling.
2. Kuaishou Big Data Architecture
The stack consists of an HDFS/HBase storage layer, a YARN resource‑scheduling layer, and an execution layer built from Flink, MapReduce, Spark, Presto, TensorFlow and other engines, topped by application platforms such as Flink job hosting, machine‑learning, and SQL submission services.
3. YARN Resource Scheduling System
YARN, introduced with Hadoop 2.0, separates cluster‑wide resource management (ResourceManager) from per‑application management (ApplicationMaster), moving from a single‑level to a two‑level scheduler architecture.
Stage 1: Scheduling triggered by heartbeats, with scheduling and heartbeat logic sharing a thread, causing interference.
Stage 2: Scheduling logic moved to a separate thread, but still contended for a global lock and suffered from costly queue sorting.
Stage 3: Introduction of a global scheduler allowing concurrent queue processing; performance improved but fairness and sorting overhead remained issues.
Kuaishou initially used the Fair Scheduler in a single‑threaded mode, then deployed a custom Kwai Scheduler that supports plug‑in scheduling policies and scales to clusters of tens of thousands of nodes.
4. Kwai Scheduler Overview
The Kwai Scheduler replaces the Fair Scheduler’s logic while keeping the ResourceManager’s RPC and event layers unchanged. Each scheduling round snapshots the cluster state, performs batch concurrent scheduling, and pushes results back to the cluster; applications retrieve containers via the existing heartbeat interface.
The scheduler’s workflow (illustrated below) uses the cluster snapshot to pre‑allocate resources for each application, then lets applications compete concurrently for free nodes. High‑performance is achieved by avoiding lock contention and allowing multi‑threaded, per‑application scheduling.
Scheduling strategies are implemented via filter and score interfaces. The filter removes unsuitable nodes, while the score assigns a numeric value to each node; lower scores are preferred. For example, a task‑scattering strategy gives nodes with more allocated application resources a lower score, spreading tasks across the cluster.
5. Multi‑Scenario Optimizations
Offline ETL
Resource contention from non‑priority queues is mitigated by controlled pre‑emptive stealing, limited to high‑priority core jobs.
Virtual queues provide logical isolation within physical queues, guaranteeing resources for high‑priority jobs.
AppSlot pre‑emptive stealing releases slots occupied by low‑priority jobs when high‑priority jobs are pending too long.
Back‑track jobs are throttled by limiting total resources and pending app counts, and by back‑pressuring upstream workflow systems.
Reserve‑slot pre‑emption reclaims resources from low‑priority containers held by reserved nodes.
Abnormal nodes are identified via metrics (failure rate, shuffle failures, etc.) and excluded from scheduling to avoid long‑tail failures.
Ad‑hoc Query
Virtual queues are created per user, enabling fair user‑level resource distribution and user‑level pre‑emption to prevent a single user from monopolizing resources.
Machine‑Learning Training
Training jobs require “all‑or‑nothing” resources; a round‑robin (FIFO‑like) policy guarantees the head of the queue receives enough resources to avoid deadlock.
Flink Real‑Time Jobs
CPU‑balancing scheduling avoids hotspot nodes.
ApplicationMaster failure avoidance prevents scheduling onto nodes with known AM failures.
NodeManager suspension (NM‑pause) stops new tasks on problematic nodes, reducing task loss.
Hawk provides second‑level node failure detection for rapid job recovery.
Resource‑redundancy allocation removes the need for separate resource acquisition and jar download steps, achieving near‑second recovery times.
6. Future Work
Support for ultra‑large clusters (hundreds of thousands of nodes) using community‑driven federation.
Cross‑IDC Hadoop cluster construction to handle limited inter‑IDC bandwidth.
Mixed deployment of offline workloads on idle online machines, requiring isolation and scheduling improvements.
Unified management of offline resources across YARN (offline) and Kubernetes (online) for more elastic provisioning.
Stream‑shuffle service to convert random I/O in MR/Spark shuffle to sequential I/O, boosting cluster compute capacity.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
