Optimizing AI GPU Utilization with Multi‑Cluster Priority Scheduling on ACK One
In the era of large AI models, ACK One’s multi‑cluster fleet provides inventory‑aware elastic scheduling, cluster‑level priority dispatch, and hybrid‑cloud strategies to maximize GPU utilization, ensure business continuity, and reduce costs across regions and on‑premise data centers.
Why Multi‑Cluster Scheduling Is Needed
AI workloads now demand massive GPU resources, but GPUs are often unevenly distributed, scarce, expensive, and subject to compliance constraints. Without global scheduling, AI tasks may wait in one cluster while GPUs sit idle elsewhere.
ACK One Fleet Overview
ACK One is Alibaba Cloud’s enterprise‑grade multi‑cluster management solution that intelligently schedules AI inference services across clusters, improving GPU utilization and providing an end‑to‑end management plane for AI workloads.
Key Multi‑Cluster Scenarios
Cross‑Region Multi‑ACK Clusters : Prioritize the primary region’s compute, fall back to secondary regions when needed, and release standby resources first.
Hybrid‑Cloud Multi‑Cluster (IDC K8s + ACK) : Fill on‑premise IDC GPU capacity first, then supplement with cloud GPU to ensure cost‑effective, compliant, and scalable AI services.
Core Scheduling Capabilities
Inventory‑Aware Elastic Scheduling : Global Scheduler combines with ACK node instant elasticity to allocate GPU based on real‑time inventory across regions and clouds.
Cluster‑Level Priority Scheduling : Deploy AI services to higher‑priority clusters first; if capacity is insufficient, lower‑priority clusters are used. Replicas can be split across clusters according to priority.
Workload Preemption : High‑priority workloads can preempt lower‑priority ones using PriorityClass.
Partial Replica Success : For Deployments, idle GPU shortage triggers partial replica execution to maximize resource use.
Dynamic Resource Scheduling : Global Scheduler detects free resources and distributes replicas based on weighted policies.
Static Weight Scheduling : Administrators assign weight coefficients to clusters for proportional replica allocation.
Rescheduling : Pending pods are automatically rescheduled to clusters with sufficient resources.
Multi‑Cluster HPA : Horizontal Pod Autoscaler operates across clusters using metrics from each sub‑cluster.
Multi‑Cluster Gray Release : Implements staged rollouts via Kruise Rollouts.
Cross‑Region Model Distribution : Uses OCI Image acceleration for fast, reliable model propagation.
AI Job Support
Supports PyTorchJob, TFJob, SparkApplication, Argo Workflow, etc.
Multi‑Cluster Gang Scheduling : Guarantees that all pods of a job are scheduled together across clusters.
Quota Management : ElasticQuotaTree provides namespace‑level resource limits and dynamic sharing.
Priority‑Based Scheduling : Pods inherit priority from PriorityClass definitions.
Job Rescheduling After Failure : Failed jobs are reclaimed and re‑dispatched to other suitable clusters.
Configuration Example
apiVersion: policy.one.alibabacloud.com/v1alpha1
kind: PropagationPolicy
metadata:
name: vllm-deploy-pp
namespace: test
spec:
autoScaling:
# Enable inventory awareness
ecsProvision: true
placement:
# Define cluster priority
clusterAffinities:
- affinityName: ack-region1
clusterNames:
- ${Cluster 1 ID}
- ${Cluster 2 ID}
- affinityName: ack-region2
clusterNames:
- ${Cluster 3 ID}
replicaScheduling:
replicaSchedulingType: Divided
replicaDivisionPreference: Weighted
weightPreference:
dynamicWeight: AvailableReplicas
preserveResourcesOnDeletion: false
resourceSelectors:
- apiVersion: apps/v1
kind: Deployment
namespace: test
schedulerName: default-schedulerDefining Cluster Groups and Priorities
Schedule to high‑priority clusters first; if they lack capacity, fall back to lower‑priority clusters. Within a group, clusters share the same priority.
For Deployments with many replicas, allocate as many as possible to high‑priority clusters before using lower‑priority ones.
During scale‑down, reduce replicas in lower‑priority clusters first to release their resources.
Conclusion
ACK One’s fleet delivers a comprehensive multi‑cluster scheduling framework for AI workloads, combining inventory‑aware elasticity, priority‑based dispatch, hybrid‑cloud cost optimization, and advanced features such as multi‑cluster HPA and gray releases, thereby maximizing GPU utilization and ensuring continuous, cost‑effective AI services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
