Cloud Native 17 min read

How Koordinator + KubeDL Revolutionize AI Model Training on Kubernetes

This article explains how the open‑source Koordinator scheduler, combined with KubeDL, tackles the resource‑intensive demands of large‑scale AI and LLM training on Kubernetes by introducing heterogeneous resource management, elastic quota, coscheduling, and fine‑grained GPU & RDMA allocation.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
How Koordinator + KubeDL Revolutionize AI Model Training on Kubernetes

Background and Motivation

Since the release of ChatGPT in November 2022, AI large‑language models (LLMs) have sparked unprecedented interest, prompting enterprises and research labs to launch new models weekly. Training these models at billions‑parameter scale requires massive, heterogeneous compute resources, high‑bandwidth low‑latency networking, and robust scheduling to avoid resource contention and ensure reliability.

Koordinator Overview

Koordinator is Alibaba Cloud’s open‑source, next‑generation scheduler built on lessons from its internal unified scheduling system. It runs on Kubernetes and supports mixed‑workload scheduling, aiming to improve runtime efficiency and reliability for latency‑sensitive jobs, batch tasks, big‑data workloads, and AI training.

Integrating Koordinator with KubeDL

In Alibaba Cloud’s AI training environment, Koordinator handles heterogeneous resource allocation while KubeDL manages the lifecycle and queuing of training jobs. KubeDL’s unified controller supports TensorFlow, PyTorch, Mars, and other frameworks, and can leverage Gang scheduling provided by Koordinator for seamless migration.

Job Scheduling and Queuing

Jobs are high‑level abstractions that can be split into parallel sub‑tasks. To prevent resource starvation, Koordinator uses a queueing mechanism where each job is evaluated based on priority, resource demand, and fairness. KubeDL’s extensible plugins (Filter, Score, etc.) further refine scheduling decisions.

Elastic Quota for Fair Resource Sharing

Ensures that no single job monopolizes resources.

Allows idle quota to be temporarily borrowed by high‑demand jobs and reclaimed later.

Supports hierarchical (tree‑structured) quota management across namespaces, enabling complex organizational budgeting.

Koordinator implements ElasticQuota by adopting the community scheduler‑plugins CRD, preserving compatibility with existing clusters.

Coscheduling (All‑or‑Nothing) Support

When a job’s pod group is scheduled, all pods must acquire resources simultaneously; otherwise the scheduling attempt fails. This prevents deadlocks caused by partial resource allocation. Koordinator integrates the community coscheduling plugin and provides a KubeDL Gang Scheduler for AI workloads.

Fine‑Grained Device Management

Kubernetes’ default device management via kubelet and device plugins lacks global optimization, joint GPU & RDMA allocation, and device sharing. Koordinator introduces a Device CRD that reports topology information to the scheduler. The koord‑runtime‑proxy intercepts CRI requests, injects device environment variables, and forwards them to the container runtime, enabling precise placement based on hardware topology.

GPU & RDMA Joint Allocation Based on Topology

Training large models benefits from colocating GPUs with RDMA NICs to minimize latency. Koordinator’s scheduler attempts placement in the order of same PCIe switch, same NUMA node, same NUMA socket, then cross‑NUMA, respecting differences between NVIDIA A100 and H100 topologies. It also enforces vendor‑defined GPU grouping rules for multi‑GPU allocation.

Future Directions: NRI/CDI Integration

Because modifying kubelet startup parameters is cumbersome, the Koordinator community plans to adopt NRI (Node Resource Interface) and CDI (Container Device Interface) mechanisms, collaborating with Intel, to provide a more portable device‑injection solution.

Community and Adoption

Koordinator has been open‑sourced for over a year, gaining contributions from many Chinese enterprises and being used extensively within Alibaba Cloud. The project encourages community participation, regular bi‑weekly meetings, and feedback from users.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

KubernetesSchedulerGPUAI trainingKoordinatorKubeDL
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.