Mooncake: Open-Source KVCache-Centric Architecture Boosting Large-Model Inference

Mooncake, an open-source KVCache-centric inference architecture co-developed by Alibaba Cloud and Tsinghua University's MADSys lab, dramatically improves large-model throughput and reduces cost by decoupling resources, standardizing cache pooling, and integrating with frameworks like vLLM, sparking broad industry interest.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
Mooncake: Open-Source KVCache-Centric Architecture Boosting Large-Model Inference

In June 2024, the domestic large-model application month highlighted the KVCache-centric inference architecture Mooncake, jointly released by Kimi and Tsinghua University's MADSys lab (Machine Learning, AI, Big Data Systems Lab). By leveraging KVCache-centric PD separation and a store-compute architecture, Mooncake significantly boosts Kimi smart-assistant inference throughput while reducing cost, attracting wide industry attention.

Mooncake architecture diagram
Mooncake architecture diagram

Based on the Innovation Research (AIR) program between Alibaba Cloud and Tsinghua University, the two parties explored practical industrial applications of large-model resource pooling, producing numerous technical results. To accelerate large-model inference, especially standardizing the cache-pooling layer for shared inference instances, Alibaba Cloud and Tsinghua deeply co-built the Mooncake project, abstracting the underlying cache-pooling interface and achieving an efficient distributed resource decoupling architecture optimized for large-model scenarios.

As an AI infrastructure provider, Alibaba Cloud contributed code to key components such as the Transfer Engine, peer-to-peer storage, and high-performance memory storage. At the inference framework level, it integrated Mooncake with the widely used vLLM framework, markedly improving inference performance and offering reference implementations for other frameworks. The Transfer Engine leverages Alibaba's self-developed eRDMA network and plans CXL support, enabling rapid, scalable deployment in the cloud.

Professor Zhang Mingxing of Tsinghua's MADSys lab noted that Mooncake fully utilizes CPU, memory, and SSD resources in AI infrastructure, speeds up inference request handling, and enables cache sharing across different inference instances through resource-decoupling, reducing waste. The open-source release aims to unite industry, academia, and research to accelerate large-model inference system development.

Looking ahead, Alibaba Cloud will deepen its participation in Mooncake, collaborating with more enterprises, institutions, and universities to explore more efficient and advanced model inference system architectures, ensuring large-model technology benefits a wide range of industries.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsopen sourceAI Infrastructureresource poolingKVCache
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.