Kwai Scheduler: Scaling YARN for Ultra‑Large Clusters at Kuaishou
This article presents Kuaishou's large‑scale offline computing challenges and describes how the team customized YARN and built the Kwai scheduler to achieve multi‑threaded, pluggable resource scheduling for clusters of tens of thousands of nodes, supporting diverse workloads such as ETL, ad‑hoc queries, machine‑learning training, and real‑time Flink jobs.
Guest: Fang Xiaojing – Kuaishou Big Data Architect Editor: Qi Laijun Platform: DataFunTalk
Overview: Rapid business growth has led to a massive increase in offline compute cluster size and job submissions at Kuaishou. To support ultra‑large clusters and diverse scheduling scenarios, the Kuaishou big‑data team heavily customized YARN, enabling flexible resource scheduling across multiple workloads.
Presentation Outline:
Scheduling background and Kuaishou data scale
Introduction to the Kwai scheduler
Optimizations for multiple scheduling scenarios
Other work and future plans
01 – Kuaishou Data Scale & Scenarios
Kuaishou operates offline compute clusters with tens of thousands of machines, processing hundreds of petabytes of data daily and handling millions of jobs, posing significant storage, computation, and scheduling challenges.
The architecture consists of:
Storage layer built on HDFS/HBase for massive data storage.
YARN resource‑scheduling layer for million‑scale job and task dispatch.
Execution layer with various engines such as Flink, MapReduce, Spark, Presto, and TensorFlow.
Application layer providing services like Flink job hosting, machine‑learning platforms, and SQL submission portals.
This talk focuses on the YARN layer, which rapidly schedules tasks from the compute engines onto suitable machines.
02 – Kwai Scheduler Introduction
The Kwai scheduler was created to overcome performance bottlenecks of the native YARN fair scheduler, which could only schedule one task per round and required costly queue and app sorting.
Key features:
Multi‑threaded batch scheduling that can dispatch tens of thousands of tasks per round.
Pluggable scheduling architecture: first select an APP, then select nodes, allowing custom strategies per scenario.
Global cluster‑state snapshot is taken at each scheduling cycle; resources are pre‑allocated to apps based on this snapshot.
Apps concurrently compete for free node resources; successful competition completes the app’s resource allocation.
Scheduling strategies implement filter (node filtering) and score (node ranking) interfaces.
Typical strategy examples include task scattering (lower score for nodes with many allocated app resources) and priority‑based node selection.
Online performance after deployment: single clusters now reach tens of thousands of nodes, >10,000 concurrent jobs, peak scheduling throughput >50k/s, and resource allocation rates above 93%.
03 – Multi‑Scenario Optimizations
Offline ETL: Core jobs share queues with regular jobs; job‑level prioritization and resource pre‑emption ensure SLA for critical workloads.
Virtual Queues: Logical queues derived from physical queues allocate resource quotas to high‑priority jobs, triggering internal pre‑emption when needed.
App‑Slot Pre‑emption: When high‑priority jobs are pending, low‑priority running apps are put into sleep mode or have their containers reclaimed to free slots.
Backfill Jobs: Limit maximum resources and running app counts for backfill jobs; integrate with upstream workflow schedulers to apply back‑pressure and cap pending apps.
Reserve Pre‑emption: Reclaim resources from nodes reserved by low‑priority containers to satisfy high‑priority requests.
Abnormal Node Avoidance: Nodes with high failure rates, slow task execution, or shuffle errors are marked abnormal and excluded from new task scheduling.
Ad‑hoc Queries: User‑level virtual queues provide fair resource distribution and enable user‑based pre‑emption to prevent resource monopolization.
Machine‑Learning Training: FIFO‑style app rotation guarantees that head‑of‑queue apps receive full resource allocations, avoiding deadlocks.
Flink Real‑Time Jobs: Techniques include CPU‑balanced scheduling, AM‑failure node avoidance, NM‑suspend mechanisms, and rapid node‑failure detection via Hawk for sub‑second job recovery.
04 – Other Work & Future Plans
Support for ultra‑large clusters (targeting 100k nodes) using community‑driven federation.
Cross‑IDC Hadoop clusters: Optimize data and compute placement under limited inter‑IDC bandwidth.
Mixed offline/online resource deployment: Leverage idle online machines for offline tasks while ensuring isolation.
Unified resource management: Explore tighter integration between YARN (offline) and Kubernetes (online) for elastic resource sharing.
Shuffle service development: Replace random I/O in MR/Spark shuffle with sequential I/O via a streaming shuffle service, improving compute efficiency and reducing I/O impact.
For questions or collaboration opportunities, feel free to contact the speaker. Thank you for listening.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.