How Alibaba’s Tair KVCache Manager Revolutionizes Enterprise‑Level LLM Cache Management
This article details the architecture and implementation of Tair KVCache Manager, an enterprise‑grade service that centralises KVCache metadata, decouples inference engines from storage, provides elastic scaling, multi‑tenant isolation, high availability, and performance‑optimised cache management for large‑scale LLM inference workloads.
Background
Agentic AI workloads generate long‑lived, high‑concurrency inference sessions with multi‑turn interactions. Traditional single‑node KVCache designs cannot keep up with the resulting increase in KVCache miss rates and non‑linear growth of Prefill load, especially when contexts span hundreds of thousands of tokens.
Design Goals
The system must provide:
Accurate capacity assessment and dynamic elastic scaling.
Multi‑tenant isolation and high‑availability guarantees.
Version‑aware management for seamless model upgrades.
Cost‑effective operation at PB‑scale storage.
System Architecture
Tair KVCache Manager (Tair KVCM) is a centralized C++ service exposing HTTP and gRPC interfaces. It separates the control plane (metadata, quota, eviction) from the data plane (actual KVCache storage), allowing inference engines to read/write directly to backend storage while the manager handles metadata.
Control‑Plane Concepts
Storage : a backend (e.g., NFS, 3FS, TairMemPool, Mooncake) with its own connection parameters; multiple Instance Groups may share a Storage.
Instance Group : a logical group that shares a quota and a list of Storages; used to represent a business team, a model family, or a set of long‑tail models.
Instance : a KVCache instance bound to a single model configuration (e.g., fp8/bf16, block size) and belonging to exactly one Instance Group.
Data‑Plane Concepts
Block : a fixed‑length token segment; each block carries prefix‑dependency information.
CacheLocation : a storage location for a Block; a Block may have multiple CacheLocations for different storage types.
LocationSpec : a part of a CacheLocation expressed as a URI with size and spec name, enabling mixed‑attention scenarios to store only required parts.
Management APIs
CRUD APIs for Storage, Instance Group, and Instance objects.
Account APIs for permission control.
Metrics APIs for observability.
Metadata APIs such as RegisterInstance, GetCacheLocation, StartWriteCache, FinishWriteCache, RemoveCache, and TrimCache. These support KV‑style, prefix‑style, and sliding‑window queries.
Capacity Management (Reclaimer & Executor)
Each Instance Group can configure a total quota and per‑storage‑type quotas (e.g., 100 TB total, 1 TB TairMemPool, 99 TB 3FS). Water‑mark thresholds trigger eviction when usage exceeds the configured level. The Reclaimer runs asynchronously, supports LRU, LFU, TTL policies, and can be tuned per Instance Group. Quota and water‑mark values are updatable at runtime.
Optimizer Module
The Optimizer simulates real trace data using a radix‑tree prefix index, evaluates cache hit rates under different capacities and eviction policies, and maps cache‑induced Prefill reduction to GPU compute savings. It helps find the optimal storage configuration that balances cost and performance.
Future Work
Extend support to multimodal KVCache scenarios such as VLCache.
Adapt to private‑cloud and super‑node environments with specialized eviction algorithms.
Broaden integration with mainstream inference engines and additional storage backends.
Develop hierarchical eviction strategies tailored to LLM access patterns.
Strengthen enterprise‑grade features for large‑scale deployments.
References
RTP‑LLM GitHub repository: https://github.com/alibaba/rtp-llm
Qwen3‑Coder model on HuggingFace: https://huggingface.co/RTP-LLM/Qwen3-Coder-30B-A3B-Instruct-RTPurbo
DeepSeek Prefill throughput report: https://zhuanlan.zhihu.com/p/1905713771005063765
LLM usage traces: https://github.com/alibaba-edu/qwen-bailian-usagetraces-anon
VLCache paper: https://arxiv.org/abs/2512.12977
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
