Architect
Jul 2, 2024 · Artificial Intelligence
Mooncake: A Separated Architecture for Large‑Language‑Model Inference
The article presents Mooncake, a split‑architecture inference platform for the Kimi LLM assistant, detailing its three elastic resource pools, the rationale for using Time‑Between‑Tokens over TPOT, and design choices for Prefill, KVCache, and Decode stages to improve latency and throughput.
AI systemsKVCacheLLM inference
0 likes · 9 min read