Artificial Intelligence 9 min read

How Multi-Cluster Smart Scheduling Cuts AI Inference Costs with ACK One

This article explains how Alibaba Cloud's ACK One fleet uses inventory‑aware multi‑cluster elastic scheduling to dynamically allocate GPU resources across regions, reducing AI inference costs while ensuring high availability and seamless scaling for large‑model services.

Alibaba Cloud Infrastructure

Aug 6, 2025

How Multi-Cluster Smart Scheduling Cuts AI Inference Costs with ACK One

ACK One Multi-Cluster Scheduling and Application Distribution

Alibaba Cloud's Distributed Cloud Container Platform (ACK One) provides enterprise‑grade multi‑cluster management, allowing registered clusters from public clouds or IDC K8s to be unified under the ACK console. The fleet enables cross‑cluster application distribution, traffic management, observability, and security.

Multi-Cluster Scheduling Capabilities

Elastic scheduling across clusters: when a sub‑cluster lacks resources, the fleet senses inventory and automatically moves workloads to clusters with available GPU stock, expanding node pools as needed.

Static weight and dynamic scheduling: users can set replica distribution ratios per cluster (static) or let the fleet dynamically allocate more replicas to clusters with higher available resources.

Gang scheduling: deep integration with PytorchJob and SparkApp allows coordinated scheduling of distributed jobs across multiple clusters, maximizing resource utilization and supporting multi‑tenant quota management.

Rescheduling: the fleet continuously monitors pods, automatically redeploying failed replicas to maintain service health.

Application‑level fault migration: if a deployment’s ready replica count drops below expectations, the fleet triggers cross‑cluster migration to restore the application.

Basic Principles of Multi-Cluster Elastic Scheduling

The fleet first creates an application and a distribution policy. When the scheduler detects insufficient resources in a sub‑cluster, it invokes the ACK GOATScaler to check inventory. Based on the inventory result, the scheduler redirects the workload to a cluster with available stock, after which GOATScaler expands the node pool so the application can run.

Operation Process (Example with qwen3‑8b Inference Service)

Model preparation: upload a custom or open‑source model (e.g., Qwen) to OSS, optionally using multi‑region OSS management.

Environment preparation: create a fleet cluster and two (or more) regional ACK clusters, enable instant elasticity, and create GPU node pools.

Create the inference application: the fleet scheduler will allocate replicas to clusters with available inventory.

Elastic verification: after scheduling, GPU nodes are added to the target node pool; scaling down the application triggers node pool shrinkage after a configurable interval.

Service exposure: configure an ALB multi‑cluster gateway to expose the cross‑region inference service.

Summary

By leveraging the fleet's elastic scheduling, enterprises can build a next‑generation distributed inference architecture that intelligently scales across regions, reduces compute costs, and provides built‑in high availability through multi‑cluster disaster recovery.