Artificial Intelligence 13 min read

How Alibaba Cloud’s Container Service Accelerates Enterprise LLM Inference

The article outlines how Alibaba Cloud’s container service has evolved to support large‑scale GPU clusters, AI data pipelines, and the new AI Serving Stack, enabling enterprises to deploy, scale, and manage LLM inference services efficiently while addressing Day0‑Day2 challenges.

Alibaba Cloud Infrastructure

Oct 29, 2025

How Alibaba Cloud’s Container Service Accelerates Enterprise LLM Inference

Enterprise LLM Service Trends and Challenges

Since 2023, 33% of global enterprises have started LLM proof‑of‑concepts, a figure projected to reach 70‑75% by 2025 and 80‑85% worldwide by 2026. Companies now treat LLMs and generative AI as core infrastructure, facing technical challenges across inference, training, deployment, data processing, and operations.

Three Steps to Deploy LLMs

Day0: Model selection and performance evaluation.

Day1: Production deployment and inference operations.

Day2: Integration of model services with existing business workflows.

Most users are in the late Day0 stage and are moving toward Day1, focusing on micro‑service‑based deployment, GPU cluster stability, and cost‑performance balance.

ACK AI Serving Stack

Alibaba Cloud Container Service (ACK) introduced the AI Serving Stack to simplify production‑grade LLM inference. It provides a RoleBasedGroup (RBG) abstraction and standard API to manage the full lifecycle of inference workloads, including deployment, updates, scaling, scheduling, error handling, observability, and model‑aware routing via the K8s Gateway API.

RBG Core Capabilities

Micro‑service‑style LLM inference management : load balancing, rolling updates, gray releases, auto‑scaling, and observability.

AI data processing acceleration : reduces data‑handling latency during pre‑training, fine‑tuning, and inference.

Fine‑grained AI observability and heterogeneous cluster stability : online profiling, performance bottleneck detection, and automated remediation.

RBG abstracts roles such as Prefill, Decode, Scheduler, AutoScaler, and KVCache, modeling dependencies, affinity, and scaling logic to enable consistent management of distributed inference services.

Intelligent Routing and Load Balancing

Using the Gateway API Inference Extension, RBG implements model‑aware routing based on KVCache usage, queue length, and LoRA adaptation, achieving >70% reduction in time‑to‑first‑token and increasing KVCache utilization from 70% to 90%.

Data Processing Enhancements

ACK extends scheduling, elasticity, observability, and data acceleration to big‑data workloads, supporting large‑scale Spark, Ray, Flink, and Argo Workflow jobs. It also provides unified task scheduling and resource management for multimodal AI training and inference.

Ray on ACK

Enhanced KubeRay Operator with managed upgrades and dynamic scaling.

Support for gang and capacity scheduling, priority queues.

Stable operation of clusters with up to 8,000 nodes and tens of thousands of cores.

Open‑source History Server for multi‑tenant task tracking.

Argo Workflow Managed Service

Supports ACK/ACS cloud and hybrid node pools.

Handles ten‑thousands of concurrent workflows and hundreds of thousands of compute tasks.

Integrates with OSS, NAS, CPFS and provides multi‑tenant fair scheduling and self‑healing.

Conclusion

Over the past year, Alibaba Cloud’s container service has built a full‑stack AI Serving Stack that streamlines LLM inference deployment, performance optimization, and data processing. Open‑source projects such as RBG and Fluid further enable enterprises to adopt large models with zero‑downtime, dynamic scaling, and unified observability.