How Kimi Scaled AI Agents with Alibaba Cloud’s Elastic Sandbox Architecture
Kimi built a high‑performance, low‑cost AI Agent infrastructure by combining Alibaba Cloud ACK node pools and the ACS Agent Sandbox, addressing challenges of instant sandbox response, state continuity, massive concurrency, cost efficiency, security isolation, and search‑memory integration for production‑grade agents.
Kimi previously launched several AI Agent capabilities such as Deep Research, Agentic PPT, OK Computer, and Data Analysis, which required handling tens of thousands of concurrent user requests and massive, isolated compute resources during both online service and model training phases.
Key Challenges
Challenge 1: Instant sandbox response – Agents must start within seconds, provide strong isolation for unverified code, and avoid the multi‑minute startup times of traditional VMs or containers.
Challenge 2: State continuity and scheduling pressure – Long‑running agents need pause/resume capabilities, and the system must schedule hundreds of thousands of pods without resource contention.
Challenge 3: Cost‑effective massive concurrency – Provisioning resources for peak loads leads to waste; elastic, on‑demand scheduling is required to keep costs low.
Solution Architecture
Kimi partnered with Alibaba Cloud, using Alibaba Cloud Container Service for Kubernetes (ACK) node pools and the ACS Agent Sandbox as the core of an end‑to‑end Agent Infra platform.
ACK node pools provide instant elasticity across multiple AZs and instance types, with custom images and data‑disk snapshots that cut node‑initialization time by over 60%.
ENI pre‑allocation via the Terway network plugin eliminates network‑ready delays, enabling rapid pod startup.
ACS Agent Sandbox Features
Built on lightweight MicroVM technology, reducing virtualization overhead by ~90% and achieving second‑level sandbox startup.
Resource pre‑scheduling and image‑cache snapshots accelerate instance creation; burst quota allows temporary CPU/Memory scaling during startup, cutting Python sandbox launch time by >60%.
State‑preserving sleep/wake mechanism stores memory and disk data, enabling instant restoration and cloning for reinforcement‑learning (RL) branch exploration.
Checkpoint cloning creates thousands of identical sandbox instances in seconds, eliminating repeated initialization for Monte‑Carlo Tree Search.
Mixed Compute Scheduling
ACK ResourcePolicy defines a tiered scheduling strategy that reserves a baseline node pool for normal load and overflows excess pods to a Serverless pool (ACS Agent Sandbox) when queue length exceeds thresholds (e.g., 500 pods) or wait time >30 s, balancing cost, elasticity, and stability.
Scheduler and API Server Optimizations
Parameter tuning increases queue depth and per‑pod processing speed, supporting hundreds of pod schedules per second at ten‑thousand‑node scale.
Pod‑affinity caching and parallel dispatch reduce duplicate scheduling overhead.
Control‑plane components (ETCD, API Server, KCM, Scheduler) are deployed across multiple AZs with end‑to‑end parameter optimizations for rapid scaling.
Security Isolation
MicroVM provides hardware‑level isolation for each agent task.
NetworkPolicy enforces namespace and port isolation; Terway enhancements ensure policy scalability.
Per‑agent storage volumes or sub‑directories with ACL/POSIX permissions guarantee data isolation on shared NAS.
Search and Memory Backend
Kimi uses Alibaba Cloud Lindorm, a multi‑model database that integrates wide‑table, search, vector, and AI engines. It offers full‑text + vector RRF‑based dual‑recall, deep compression (30‑50% storage savings), and seamless data flow without custom sync pipelines.
Results
The combined ACK + ACS solution delivers stable, developer‑friendly infrastructure, achieving tens of thousands of sandbox instances per minute, halving startup latency, and dramatically lowering total cost of ownership. It supports production‑grade AI agents with continuous state, fast cloning for RL, robust security, and scalable search/memory services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
