How Mooncake’s KVCache Boosts Large‑Model Inference Efficiency and Cost

Mooncake, an open‑source large‑model inference platform, introduces a KVCache‑centric architecture that dramatically improves throughput, reduces latency and cuts inference costs by up to 20%, while integrating with frameworks like SGLang and vLLM and leveraging Alibaba Cloud’s eRDMA and GPUDirect technologies for scalable, high‑performance deployments.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
How Mooncake’s KVCache Boosts Large‑Model Inference Efficiency and Cost

Background

In June 2024 the Kimi large‑model assistant and Tsinghua University’s MADSys lab introduced the Mooncake inference architecture, which places a shared KVCache at the core of the system. The design improves throughput and reduces the cost of long‑context inference. In November 2024 the project was open‑sourced by a consortium that includes Alibaba Cloud, Qujing Technology, Ant Group and 9#AISoft, with the goal of standardising cache‑pooling and resource decoupling for large‑model serving.

KVCache‑Centric Architecture

Mooncake treats the KVCache as a distributed storage layer that can be reused across multiple inference instances. The architecture consists of:

Transfer Engine : provides full‑link zero‑copy communication, supports up to eight 400 Gbps NICs, is topology‑aware, and includes fault tolerance, load balancing and multi‑protocol handling.

KVCache Store : occupies idle GPU memory and leverages inter‑GPU bandwidth to minimise response latency while lowering hardware cost. Control‑path overhead is reduced by using Alibaba’s open‑source RPC framework coro_rpc.

Both components are designed to work on eRDMA‑based networks and are compatible with GPUDirect, enabling high‑performance data movement without extra copies.

Integration with Existing Inference Frameworks

Mooncake publishes .whl packages for pip installation and Docker images for containerised deployment. It integrates with popular open‑source inference stacks such as SGLang and vLLM, exposing a PD‑separated stack that isolates model parameters (P), data (D), tensor parallelism (TP) and pipeline parallelism (PP) from the KVCache (PD). When the Transfer Engine is enabled, the stack achieves near‑zero‑copy communication via GPUDirect RDMA (GDR), supporting EP + DP + TP + PD deployment scenarios for models like DeepSeek.

Performance Results

Benchmarks reported by the Mooncake team show:

Approximately 20 % reduction in total per‑token (TPOT) latency, bringing inference cost down to roughly $0.20 per million tokens.

When combined with the LMCache plugin, average response time under cache‑hit conditions drops by 69.1 % and throughput increases by 191 %.

These gains stem from KVCache reuse across requests, zero‑copy data paths, and the ability to pool idle GPU memory.

Deployment and Ecosystem Adoption

Mooncake’s Docker images and pip packages simplify large‑scale deployment on cloud environments that provide Alibaba’s eRDMA network path. The project has become the default inference solution for the SGLang community and is referenced by the Dynamo ecosystem’s Nixl transport framework.

Community Impact

On GitHub the repository https://github.com/kvcache-ai/mooncake has accumulated over 3,000 stars and more than twenty active contributors. The codebase is actively merged into other open‑source inference projects and has attracted interest from engineers at Tencent, Meituan, iFlytek and other enterprises.

Roadmap

Future work includes:

Release of Mooncake Store v2, which will enable KVCache sharing across multiple LLM instances.

Support for additional inference frameworks such as LMDeploy and TensorRT‑LLM.

Further performance enhancements through plugins like LMCache and deeper integration with high‑performance networking stacks.

Key Repository

GitHub:

https://github.com/kvcache-ai/mooncake
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

distributed systemslarge-model inferenceOpen SourceAlibaba CloudKVCacheAI performance
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.