Artificial Intelligence 28 min read

How Alibaba’s Tair KVCache Manager Revolutionizes Enterprise‑Level LLM Cache Management

This article details the architecture and implementation of Tair KVCache Manager, an enterprise‑grade service that centralises KVCache metadata, decouples inference engines from storage, provides elastic scaling, multi‑tenant isolation, high availability, and performance‑optimised cache management for large‑scale LLM inference workloads.

Alibaba Cloud Developer

Dec 29, 2025

How Alibaba’s Tair KVCache Manager Revolutionizes Enterprise‑Level LLM Cache Management

Background

Agentic AI workloads generate long‑lived, high‑concurrency inference sessions with multi‑turn interactions. Traditional single‑node KVCache designs cannot keep up with the resulting increase in KVCache miss rates and non‑linear growth of Prefill load, especially when contexts span hundreds of thousands of tokens.

Design Goals

The system must provide:

Accurate capacity assessment and dynamic elastic scaling.

Multi‑tenant isolation and high‑availability guarantees.

Version‑aware management for seamless model upgrades.

Cost‑effective operation at PB‑scale storage.

System Architecture

Tair KVCache Manager (Tair KVCM) is a centralized C++ service exposing HTTP and gRPC interfaces. It separates the control plane (metadata, quota, eviction) from the data plane (actual KVCache storage), allowing inference engines to read/write directly to backend storage while the manager handles metadata.

Control‑Plane Concepts

Storage : a backend (e.g., NFS, 3FS, TairMemPool, Mooncake) with its own connection parameters; multiple Instance Groups may share a Storage.

Instance Group : a logical group that shares a quota and a list of Storages; used to represent a business team, a model family, or a set of long‑tail models.

Instance : a KVCache instance bound to a single model configuration (e.g., fp8/bf16, block size) and belonging to exactly one Instance Group.

Data‑Plane Concepts

Block : a fixed‑length token segment; each block carries prefix‑dependency information.

CacheLocation : a storage location for a Block; a Block may have multiple CacheLocations for different storage types.

LocationSpec : a part of a CacheLocation expressed as a URI with size and spec name, enabling mixed‑attention scenarios to store only required parts.

Management APIs

CRUD APIs for Storage, Instance Group, and Instance objects.

Account APIs for permission control.

Metrics APIs for observability.

Metadata APIs such as RegisterInstance, GetCacheLocation, StartWriteCache, FinishWriteCache, RemoveCache, and TrimCache. These support KV‑style, prefix‑style, and sliding‑window queries.

Capacity Management (Reclaimer & Executor)

Each Instance Group can configure a total quota and per‑storage‑type quotas (e.g., 100 TB total, 1 TB TairMemPool, 99 TB 3FS). Water‑mark thresholds trigger eviction when usage exceeds the configured level. The Reclaimer runs asynchronously, supports LRU, LFU, TTL policies, and can be tuned per Instance Group. Quota and water‑mark values are updatable at runtime.

Optimizer Module

The Optimizer simulates real trace data using a radix‑tree prefix index, evaluates cache hit rates under different capacities and eviction policies, and maps cache‑induced Prefill reduction to GPU compute savings. It helps find the optimal storage configuration that balances cost and performance.

Future Work

Extend support to multimodal KVCache scenarios such as VLCache.

Adapt to private‑cloud and super‑node environments with specialized eviction algorithms.

Broaden integration with mainstream inference engines and additional storage backends.

Develop hierarchical eviction strategies tailored to LLM access patterns.

Strengthen enterprise‑grade features for large‑scale deployments.

References

RTP‑LLM GitHub repository: https://github.com/alibaba/rtp-llm

Qwen3‑Coder model on HuggingFace: https://huggingface.co/RTP-LLM/Qwen3-Coder-30B-A3B-Instruct-RTPurbo

DeepSeek Prefill throughput report: https://zhuanlan.zhihu.com/p/1905713771005063765

LLM usage traces: https://github.com/alibaba-edu/qwen-bailian-usagetraces-anon

VLCache paper: https://arxiv.org/abs/2512.12977

Performance optimization scalability LLM distributed storage Cache Management KVCache

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.