How Kimi’s AI Agent Scales on Alibaba Cloud – Architecture, Elastic Sandbox, and Cost Optimisation
The article analyses how Kimi’s AI Agent workloads are deployed on Alibaba Cloud using ACK and the ACS Agent Sandbox, detailing the challenges of massive concurrency, rapid sandbox start‑up, state continuity, cost‑effective scaling, and the security and scheduling mechanisms that enable production‑grade performance.
Background
Kimi has turned its Agent capabilities into concrete products such as “Deep Research”, “Agentic PPT”, “OK Computer” and data‑analysis agents. During peak periods the C‑side Agent service handles tens of thousands of concurrent requests, each requiring isolated compute resources to preserve user experience. Model training also demands massive, isolated, and frequently started/stopped compute resources for reinforcement learning and data synthesis.
Challenges
Challenge 1 – Instant response sandbox : Traditional VMs or containers take minutes to start, which is unacceptable for the bursty, latency‑sensitive Agent traffic. The sandbox must also provide strong isolation because agents execute unverified code generated by large models.
Challenge 2 – State continuity and scheduling pressure : Long‑running agents need session‑level state persistence and fast recovery after pause. Massive concurrent users also create huge scheduling pressure on the cluster.
Challenge 3 – Cost‑effective massive concurrency : Provisioning enough resources for peak load leads to severe waste; the system must elastically allocate resources on demand while keeping costs low.
Solution Architecture
Alibaba Cloud and Kimi co‑designed an end‑to‑end Agent Infra built around ACK (Alibaba Cloud Container Service for Kubernetes) node‑pool elasticity and the ACS Agent Sandbox (a MicroVM‑based sandbox).
ACK node‑pool elasticity : Node pools span multiple AZs and instance types (general‑purpose, compute‑optimized, storage‑optimized). Real‑time load drives automatic instance‑type selection, avoiding single‑AZ resource shortages and improving utilization. Custom images and data‑disk snapshots reduce node‑initialisation time by more than 60 % and bring cold‑start latency from minutes to seconds.
ACS Agent Sandbox provides:
MicroVM technology that cuts virtualization overhead by ~90 % and enables thousands of sandbox instances to start within seconds.
Image‑cache via cloud‑disk snapshots to avoid repeated image pulls.
Quota hot‑update that bursts CPU/Memory by several folds during the first seconds of start‑up, shrinking Python sandbox launch time by >60 % while keeping costs under control.
State‑preserving “sleep‑wake‑clone” mechanism: sandbox memory and disk are persisted on sleep, allowing second‑level wake‑up and instant cloning for reinforcement‑learning branch exploration.
ResourcePolicy‑driven tiered scheduling : ACK’s ResourcePolicy creates a two‑tier compute pool – a baseline of always‑on nodes for normal traffic and a Serverless pool (AMD EPYC‑based elastic instances) for bursty Agent tasks. When pod‑queue length exceeds a threshold (e.g., 500 pods) or wait time >30 s, excess pods overflow to the Serverless pool, achieving several‑fold capacity increase at lower unit cost.
Scheduler and API‑Server optimisation : ACK tunes scheduler queue depth, caches similar‑pod decisions, and parallelises dispatch, delivering several‑times higher throughput than upstream Kubernetes. API‑Server components (ETCD, APIServer, KCM, etc.) are deployed across many AZs with dynamic elastic scaling to sustain tens of thousands of pod creations per minute.
Security isolation : Each sandbox runs in a MicroVM providing hardware‑level isolation. NetworkPolicy, Fluid, and namespace isolation enforce pod‑level network and storage segregation. NAS storage is allocated per‑agent with ACL/POSIX controls, achieving logical isolation on shared physical storage.
Lindorm multi‑model database supplies the search‑and‑memory layer for agents, offering full‑text + vector dual‑recall (RRF) and 30‑50 % storage‑cost reduction via built‑in compression.
Key Benefits
Massive elastic sandbox launch: thousands of sandboxes start within seconds, supporting >10 k concurrent agents.
State continuity and instant cloning accelerate RL training and long‑running tasks.
Cost reduction through tiered compute, hot‑update bursts, and sandbox sleep‑mode.
Enhanced stability: scheduler and API‑Server optimisations sustain high pod‑creation rates without latency spikes.
Strong multi‑tenant security via MicroVM isolation, NetworkPolicy, and per‑agent storage ACLs.
Overall, the Kimi‑Alibaba Cloud collaboration delivers a high‑performance, low‑cost, and secure foundation that powers production‑grade AI Agents, enabling features such as “Deep Research” and “OK Computer” to serve tens of thousands of users with sub‑second responsiveness.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
