Cloud Native 22 min read

AI Training Revives Gang Scheduling in Kubernetes for Elastic Resource Orchestration

The article examines how the rise of large‑model AI training reintroduces the need for gang scheduling in Kubernetes, contrasting the rigid resource requirements of HPC‑style workloads with cloud‑native elasticity, and outlines the historical evolution, current implementations, and future directions for achieving more flexible, high‑throughput compute orchestration.

Alibaba Cloud Infrastructure

Dec 17, 2025

AI Training Revives Gang Scheduling in Kubernetes for Elastic Resource Orchestration

In the narrative of cloud‑native technologies, “elasticity” has been a core keyword, but the explosion of large‑model AI training has re‑exposed a seemingly contradictory “rigidity” in resource demands. To achieve extreme training efficiency, distributed AI jobs require all‑or‑nothing resource allocation, clashing with Kubernetes’ native incremental scheduling.

AI Training Task Characteristics

Tight communication topology and synchronization barriers : Distributed training relies on frequent parameter synchronization, e.g., All‑Reduce, which demands that all workers participate in gradient aggregation each iteration. Any missing worker blocks the entire job.

Static and strict resource binding : Frameworks built on NCCL assign fixed Rank IDs and static IP‑based topologies, making the job position‑sensitive and requiring all nodes to be known upfront.

Position sensitivity : Rank 0 and Rank 1 handle distinct data shards and cannot be swapped.

Full‑launch constraint : The job must acquire the complete set of node IPs and ports before it can start.

Why Default Kubernetes Scheduling Fails

The default FIFO, pod‑centric scheduler can cause resource deadlocks. For example, if a cluster can only run a 4‑GPU job, two jobs each needing 4 GPUs may each receive two pods, leaving both jobs unable to form a complete communication ring, resulting in idle resources.

Illustration of resource deadlock scenario

Gang Scheduling Design Principles

Atomicity check : Schedule only when the cluster has enough free resources to satisfy the entire gang.

Whole‑allocation : Allocate resources to all workers atomically once the condition is met.

Unified queuing : If resources are insufficient, keep the whole gang in the queue to avoid partial launches.

This mechanism eliminates deadlocks caused by partial allocation and ensures that compute resources are delivered effectively.

Historical Evolution of Gang Scheduling

HPC Era

In the 1990s‑2010s, high‑performance computing established the MPI standard and BSP (bulk‑synchronous parallel) model, which required strict, static resource allocation. Schedulers such as Slurm, LSF, and PBS implemented full‑reservation and back‑fill algorithms to guarantee atomic job execution while improving cluster utilization.

Big Data Era

With the advent of Google’s GFS and MapReduce, and later Spark, workloads became loosely coupled, focusing on throughput rather than strict synchronization. YARN’s design prioritized task‑level scheduling, and gang scheduling was largely unnecessary, appearing only as a special extension for MPI jobs.

AI Era

Modern AI training re‑adopts the tight coupling of HPC (All‑Reduce), but runs on Kubernetes, which was originally built for stateless micro‑services. Kubernetes schedules pods individually, lacking a global job view, leading to resource fragmentation and deadlocks for AI workloads.

Why Kubernetes Long Ignored Gang Scheduling

Kubernetes’ initial goal was high‑throughput handling of independent, stateless pods. Introducing gang scheduling would require managing pod groups and adding wait logic, increasing scheduler complexity and risking head‑of‑line blocking. Consequently, the community remained cautious about integrating gang logic into the core scheduler.

Community Solutions

Independent scheduler mode : Projects like Volcano run a separate batch scheduler that defines custom CRDs for jobs or pod groups, providing gang capabilities.

Scheduling framework plugin mode : With the scheduling framework (Kubernetes 1.15+), plugins such as the SIG‑Scheduling Coscheduling plugin or Koordinator use the native scheduler’s extension points. They declare a PodGroup CRD and employ a Permit stage to hold pods until the minimum group size is met.

Standardization Efforts

Kubernetes v1.35 plans to introduce a native Workload API, standardizing gang scheduling via built‑in CRDs (e.g., PodGroup) and integrating the permit mechanism directly into the core scheduler. This will unify implementations across Volcano, Koordinator, and other plugins.

Future Outlook: From Rigid to Elastic Resources

Even though the algorithmic need for global synchronization (BSP) remains, infrastructure evolution aims to mitigate rigidity:

In‑Network Computing (e.g., NVIDIA SHARP) offloads All‑Reduce to switch ASICs, reducing latency.

3D Parallelism combines data, tensor, and pipeline parallelism to hide compute‑communication gaps.

Zero‑Redundancy Optimizer shards model state across GPUs, trading bandwidth for memory savings.

These techniques still rely on synchronous training because asynchronous updates introduce gradient staleness, which harms convergence for large‑scale LLMs.

Synchronization Domain Convergence

By increasing per‑node compute density (e.g., NVLink/NVSwitch, “Super Node” designs), the physical radius of synchronization shrinks, turning massive cross‑rack coordination into intra‑rack or intra‑node communication, thereby providing more “elastic” resource availability.

Fast Recovery and Dynamic Topology

Asynchronous snapshot : Background persistence avoids blocking the compute path.

Memory‑level state replication : Enables millisecond‑scale rollback on failures.

Dynamic topology re‑sharding : Detects node loss or new idle resources and automatically recomputes sharding and data distribution, achieving elastic scaling and self‑healing.

These advances transform AI training jobs from rigid “giant stones” into adaptable workloads that can maintain high‑throughput (goodput) despite resource fluctuations.

Conclusion

Elasticity in compute orchestration is essential for achieving high cluster throughput in the era of expensive AI hardware. By revisiting gang scheduling, standardizing workload APIs, and leveraging hardware‑level optimizations, the community is moving toward a future where rigid resource requirements coexist with cloud‑native flexibility.

cloud native high performance computing Kubernetes resource management AI training Gang Scheduling

Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.