Mooncake: A Separated Architecture for Large‑Language‑Model Inference
The article presents Mooncake, a split‑architecture inference platform for the Kimi LLM assistant, detailing its three elastic resource pools, the rationale for using Time‑Between‑Tokens over TPOT, and design choices for Prefill, KVCache, and Decode stages to improve latency and throughput.
Mooncake is the underlying inference platform of the Kimi intelligent assistant created by Moonshot, and this article serves as a condensed technical report that discusses several design choices that have not yet reached consensus in the community.
The architecture follows a classic separated design, dividing a homogeneous GPU cluster into three independently scalable resource pools: a Prefill Pool that handles user inputs and optimizes Time‑to‑First‑Token (TTFT), a KVCache Pool that provides a global prefix cache, and a Decode Pool that performs autoregressive token generation, thereby improving both TTFT and Time‑Between‑Tokens (TBT) performance.
The authors argue for using TBT instead of the more common Time‑Per‑Output‑Token (TPOT) because TBT directly measures the latency between consecutive tokens, which better reflects user experience in streaming interactions.
In the Prefill Pool discussion, they examine whether Prefill should be a separate node, introduce a VRAM‑occupation‑cost metric (KVCache size × residence time), and propose a multi‑node distributed chunked‑prefill strategy based on a simplified TeraPipe approach that reduces communication overhead.
The KVCache Pool section describes a global KVCache scheduling mechanism across machines, a heuristic hotspot‑identification and replication method, and how this global cache improves reuse and overall throughput compared with single‑node prefix caches.
The Decode Pool follows common industry practice (e.g., vLLM) and outlines future work such as splitting the decode stage into separate attention and linear pools, discussing hardware constraints and potential performance gains.
Overall, the article highlights the engineering trade‑offs of a KVCache‑centric, separated inference architecture and provides a link to the full paper for further details.
详细论文,作者列表,和本文中相关概念的引用:
https://github.com/kvcache-ai/MooncakeArchitect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.