Cloud Native 9 min read

How Alibaba Cloud’s Container Stack Evolves for the AI Era

Alibaba Cloud’s container experts unveiled a comprehensive, AI‑focused upgrade across its cloud‑native stack—introducing AMD compute, dynamic scaling, AI‑native scheduling, secure execution environments, and advanced GPU profiling—to make containers the native foundation for AI workloads and accelerate enterprise AI adoption.

Alibaba Cloud Infrastructure

Oct 29, 2025

How Alibaba Cloud’s Container Stack Evolves for the AI Era

AI Era Drives Container Infrastructure Upgrade

At the 2025 Cloud Expo, Alibaba Cloud container specialist An Shaofei announced that 60% of new CPU workloads and nearly 100% of GPU workloads now run on containers, confirming containers as the native foundation for AI.

From Cloud‑Native to AI‑Native Business Trends

Traditional cloud‑native tech served web‑centric, homogeneous workloads (large‑scale CPU, small‑scale GPU). AI introduces heterogeneous demands (large‑scale CPU + large‑scale GPU) and model‑driven agents.

AI‑Native Application Paradigm

Conventional apps are instruction‑centric and static; AI‑native apps are goal‑centric, dynamic, with agents that plan and act, requiring distributed compute, continuous tasks, security, and compliance.

Full‑Stack Container Upgrade for AI

Alibaba Cloud upgraded three core areas: cloud‑native infrastructure, AI application runtime, and AI application operation & scheduling.

Cloud‑Native Infrastructure: Simplicity and Integration

ACS now offers AMD general‑purpose compute, dynamic VPA, flexible CPU‑memory ratios (1:1‑1:8). ACK adds an intelligent managed mode, simplifying capacity management, improving utilization, and supporting massive elastic scaling.

ACS Container Compute Service : Global AMD compute, 0.5 vCPU/1 GiB granularity, 1:1‑1:8 CPU‑memory ratios, supporting big data, gaming, industrial simulation, micro‑services.

ACS GPU Compute : Second‑level billing, card‑level pricing, rich observability, self‑healing.

ACK Intelligent Managed Mode : One‑click best‑practice activation, elastic node pools, auto‑upgrade, APF throttling, 80% faster startup.

ACK Pro Hybrid Cloud Node Pool : Unified management of multi‑region cloud and on‑prem resources, enterprise‑grade stability and security.

AI Application Runtime: Trust and Control

Provides a full‑stack trusted AI environment with hardware‑level encryption, isolated execution, and compliance for finance and personal data. Features trusted software supply chain (signing, SBOM, SLSA) and trusted runtime (hardware isolation for CPU/GPU, remote attestation).

AI Application Operations & Scheduling: Efficiency and Stability

GPU fault detection and self‑healing, isolating faulty GPUs/nodes and auto‑recovering.

Online real‑time GPU profiling without code changes, timeline data, bottleneck analysis, flame‑graph visualization.

AI inference suite with ACK Gateway, model‑aware routing, role‑based groups, auto‑scaling, and cold‑start acceleration.

Overall, the full‑stack upgrade positions containers as the native runtime for AI, delivering simplified infrastructure, secure and controllable environments, and robust, scalable operations to accelerate enterprise AI development.

GPU Scheduling AI infrastructure container computing secure AI runtime

Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.