Cloud Native 11 min read

Optimizing AI GPU Utilization with Multi‑Cluster Priority Scheduling on ACK One

In the era of large AI models, ACK One’s multi‑cluster fleet provides inventory‑aware elastic scheduling, cluster‑level priority dispatch, and hybrid‑cloud strategies to maximize GPU utilization, ensure business continuity, and reduce costs across regions and on‑premise data centers.

Alibaba Cloud Infrastructure

Dec 8, 2025

Optimizing AI GPU Utilization with Multi‑Cluster Priority Scheduling on ACK One

Why Multi‑Cluster Scheduling Is Needed

AI workloads now demand massive GPU resources, but GPUs are often unevenly distributed, scarce, expensive, and subject to compliance constraints. Without global scheduling, AI tasks may wait in one cluster while GPUs sit idle elsewhere.

ACK One Fleet Overview

ACK One is Alibaba Cloud’s enterprise‑grade multi‑cluster management solution that intelligently schedules AI inference services across clusters, improving GPU utilization and providing an end‑to‑end management plane for AI workloads.

Key Multi‑Cluster Scenarios

Cross‑Region Multi‑ACK Clusters : Prioritize the primary region’s compute, fall back to secondary regions when needed, and release standby resources first.

Hybrid‑Cloud Multi‑Cluster (IDC K8s + ACK) : Fill on‑premise IDC GPU capacity first, then supplement with cloud GPU to ensure cost‑effective, compliant, and scalable AI services.

Core Scheduling Capabilities

Inventory‑Aware Elastic Scheduling : Global Scheduler combines with ACK node instant elasticity to allocate GPU based on real‑time inventory across regions and clouds.

Cluster‑Level Priority Scheduling : Deploy AI services to higher‑priority clusters first; if capacity is insufficient, lower‑priority clusters are used. Replicas can be split across clusters according to priority.

Workload Preemption : High‑priority workloads can preempt lower‑priority ones using PriorityClass.

Partial Replica Success : For Deployments, idle GPU shortage triggers partial replica execution to maximize resource use.

Dynamic Resource Scheduling : Global Scheduler detects free resources and distributes replicas based on weighted policies.

Static Weight Scheduling : Administrators assign weight coefficients to clusters for proportional replica allocation.

Rescheduling : Pending pods are automatically rescheduled to clusters with sufficient resources.

Multi‑Cluster HPA : Horizontal Pod Autoscaler operates across clusters using metrics from each sub‑cluster.

Multi‑Cluster Gray Release : Implements staged rollouts via Kruise Rollouts.

Cross‑Region Model Distribution : Uses OCI Image acceleration for fast, reliable model propagation.

AI Job Support

Supports PyTorchJob, TFJob, SparkApplication, Argo Workflow, etc.

Multi‑Cluster Gang Scheduling : Guarantees that all pods of a job are scheduled together across clusters.

Quota Management : ElasticQuotaTree provides namespace‑level resource limits and dynamic sharing.

Priority‑Based Scheduling : Pods inherit priority from PriorityClass definitions.

Job Rescheduling After Failure : Failed jobs are reclaimed and re‑dispatched to other suitable clusters.

Configuration Example

apiVersion: policy.one.alibabacloud.com/v1alpha1
kind: PropagationPolicy
metadata:
  name: vllm-deploy-pp
  namespace: test
spec:
  autoScaling:
    # Enable inventory awareness
    ecsProvision: true
  placement:
    # Define cluster priority
    clusterAffinities:
    - affinityName: ack-region1
      clusterNames:
      - ${Cluster 1 ID}
      - ${Cluster 2 ID}
    - affinityName: ack-region2
      clusterNames:
      - ${Cluster 3 ID}
  replicaScheduling:
    replicaSchedulingType: Divided
    replicaDivisionPreference: Weighted
    weightPreference:
      dynamicWeight: AvailableReplicas
  preserveResourcesOnDeletion: false
  resourceSelectors:
  - apiVersion: apps/v1
    kind: Deployment
    namespace: test
  schedulerName: default-scheduler

Defining Cluster Groups and Priorities

Schedule to high‑priority clusters first; if they lack capacity, fall back to lower‑priority clusters. Within a group, clusters share the same priority.

For Deployments with many replicas, allocate as many as possible to high‑priority clusters before using lower‑priority ones.

During scale‑down, reduce replicas in lower‑priority clusters first to release their resources.

Conclusion

ACK One’s fleet delivers a comprehensive multi‑cluster scheduling framework for AI workloads, combining inventory‑aware elasticity, priority‑based dispatch, hybrid‑cloud cost optimization, and advanced features such as multi‑cluster HPA and gray releases, thereby maximizing GPU utilization and ensuring continuous, cost‑effective AI services.