Cloud Computing 12 min read

How ACK One Fleet Enables Scalable AI Workloads with Multi‑Cluster GPU Scheduling

ACK One Fleet, Alibaba Cloud's enterprise multi‑cluster solution, provides inventory‑aware elastic GPU scheduling, cross‑region resource sharing, multi‑cluster HPA and model distribution, allowing AI inference and training workloads to scale efficiently, reduce costs, and maximize GPU utilization.

Alibaba Cloud Infrastructure

Nov 3, 2025

How ACK One Fleet Enables Scalable AI Workloads with Multi‑Cluster GPU Scheduling

AI Workload on ACK One Fleet

In today’s rapid AI development, GPUs have become the core productivity for AI innovation. AI inference services demand real‑time scaling for burst traffic, and AI training requires massive compute, leading to challenges such as uneven GPU distribution, resource shortages, and high costs.

ACK One Fleet Overview

ACK One Fleet is Alibaba Cloud’s enterprise‑grade multi‑cluster management solution. It offers an intelligent compute scheduling engine that breaks resource boundaries, supports cross‑region and hybrid‑cloud multi‑cluster scheduling, quickly supplies compute, and improves GPU utilization. It also delivers end‑to‑end multi‑cluster collaboration for AI inference and AI jobs, helping enterprises achieve scalable, stable, and low‑cost AI deployment.

Capabilities for AI Inference Services

Inventory‑Aware Multi‑Cluster Elastic Scheduling : The Global Scheduler combined with instant node elasticity perceives resource inventory across regions, providing rapid elastic GPU supply and hybrid‑cloud supplementation for scarce compute.

Dynamic Resource Scheduling : Calculates the maximum number of Pods each cluster can host based on available resources and uses this as a weight for replica allocation.

Priority Preemption Scheduling : Supports priority‑based preemption using PriorityClass.

Static Weight Scheduling : Administrators can assign weight coefficients to target clusters for proportional replica distribution.

Rescheduling : Re‑schedules pending Pods caused by insufficient resources.

Multi‑Cluster HPA : Enables horizontal pod autoscaling across clusters based on metrics from multiple sub‑clusters, including custom and external metrics.

Cross‑Region Model Distribution : Accelerates model distribution to ensure fast service startup.

Capabilities for AI Jobs (Training, Data Processing, Offline Inference)

Support for Multiple Job Types : PyTorchJob, TFJob, SparkApplication, Argo Workflow.

Multi‑Cluster Gang Scheduling : Guarantees jobs run across clusters via pre‑allocation or dynamic resource detection.

Multi‑Tenant Quota Management : Uses ElasticQuotaTree for namespace‑based resource limits and dynamic reallocation.

Task Priority Scheduling : Prioritizes tasks based on PriorityClass defined in the PodTemplate.

Job Failure Rescheduling : Re‑recovers failed jobs and re‑schedules them to other suitable clusters.

Key Scenario: Inventory‑Aware Multi‑Cluster Elastic Scheduling

Two typical scenarios illustrate this capability:

Cross‑Region Multi‑Cluster GPU Elastic Supply : Unbalanced GPU types across regions are unified under ACK One Fleet, allowing elastic compute provisioning to meet diverse inference demands.

Hybrid‑Cloud Multi‑Cluster Compute Supplement : When on‑premise IDC GPU resources are insufficient, the fleet quickly supplements with cloud resources, ensuring service expansion.

The scheduler also supports cluster‑level priority: it prefers IDC K8s clusters, falling back to cloud ACK clusters only when necessary, and scales down cloud replicas first to minimize cost.

Multi‑Cluster HPA Details

Metrics are aggregated from sub‑clusters via metrics.k8s.io and custom.metrics.k8s.io. External metrics can be sourced from Prometheus or Alibaba Cloud SLS. The FederatedHPA Controller decides replica counts, scales Deployments, and works with the fleet’s multi‑cluster scheduling to expand or shrink workloads across clusters.

Multi‑Cluster Spark Job Scheduling Based on Actual Remaining Resources

The fleet enables Spark jobs to be scheduled and dispatched across clusters according to real‑time remaining resources, maximizing idle resource utilization without affecting online services.

Three core components support this:

Multi‑cluster Spark scheduling and dispatch with resource‑aware awareness.

ACK Koordinator’s offline colocation capability.

ACK Spark Operator, which translates SparkApplication resource requests into Koordinator Batch resources for scheduling.

In a single ACK cluster, Koordinator implements dynamic resource over‑commitment using Batch resources recorded as extended resources on nodes, allowing idle resources to be fully utilized. Administrators can configure reservation ratios.

In multi‑cluster mode, the Global Scheduler performs gang scheduling, monitors Spark job status, and re‑schedules failed jobs to other clusters. PriorityClass and QoS ensure Spark jobs do not impact online services.

Summary

ACK One Fleet provides a powerful multi‑cluster management platform for AI workloads, delivering rapid compute supply, high GPU utilization, multi‑cluster HPA, unified traffic gateway, and cross‑region model distribution, enabling end‑to‑end management and maximizing business value.