Artificial Intelligence 9 min read

Mooncake: A Separated Architecture for Large‑Language‑Model Inference

The article presents Mooncake, a split‑architecture inference platform for the Kimi LLM assistant, detailing its three elastic resource pools, the rationale for using Time‑Between‑Tokens over TPOT, and design choices for Prefill, KVCache, and Decode stages to improve latency and throughput.

Architect

Jul 2, 2024

Mooncake is the underlying inference platform of the Kimi intelligent assistant created by Moonshot, and this article serves as a condensed technical report that discusses several design choices that have not yet reached consensus in the community.

The architecture follows a classic separated design, dividing a homogeneous GPU cluster into three independently scalable resource pools: a Prefill Pool that handles user inputs and optimizes Time‑to‑First‑Token (TTFT), a KVCache Pool that provides a global prefix cache, and a Decode Pool that performs autoregressive token generation, thereby improving both TTFT and Time‑Between‑Tokens (TBT) performance.

The authors argue for using TBT instead of the more common Time‑Per‑Output‑Token (TPOT) because TBT directly measures the latency between consecutive tokens, which better reflects user experience in streaming interactions.

In the Prefill Pool discussion, they examine whether Prefill should be a separate node, introduce a VRAM‑occupation‑cost metric (KVCache size × residence time), and propose a multi‑node distributed chunked‑prefill strategy based on a simplified TeraPipe approach that reduces communication overhead.

The KVCache Pool section describes a global KVCache scheduling mechanism across machines, a heuristic hotspot‑identification and replication method, and how this global cache improves reuse and overall throughput compared with single‑node prefix caches.

The Decode Pool follows common industry practice (e.g., vLLM) and outlines future work such as splitting the decode stage into separate attention and linear pools, discussing hardware constraints and potential performance gains.

Overall, the article highlights the engineering trade‑offs of a KVCache‑centric, separated inference architecture and provides a link to the full paper for further details.

详细论文，作者列表，和本文中相关概念的引用：</code><code>https://github.com/kvcache-ai/Mooncake

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed architecture LLM Inference KVCache AI Systems Decode Prefill

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.