Tag

KVCache

0 views collected around this technical thread.

DataFunSummit
DataFunSummit
Dec 4, 2024 · Artificial Intelligence

Accelerating Large Language Model Inference with the YiNian LLM Framework

This article presents the YiNian LLM framework, detailing how KVCache, prefill/decoding separation, continuous batching, PageAttention, and multi‑hardware scheduling are used to speed up large language model inference while managing GPU memory and latency.

AI accelerationGPUKVCache
0 likes · 20 min read
Accelerating Large Language Model Inference with the YiNian LLM Framework
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Nov 29, 2024 · Artificial Intelligence

Mooncake: Open-Source KVCache-Centric Large Model Inference Architecture Co-Developed by Alibaba Cloud and Tsinghua University

In June 2024, Alibaba Cloud and Tsinghua University's MADSys Lab announced the open‑source Mooncake architecture, a KVCache‑centered large‑model inference framework that boosts throughput, lowers cost, and standardizes resource‑pooling techniques for high‑performance AI inference across industry and academia.

AI infrastructureAlibaba CloudKVCache
0 likes · 4 min read
Mooncake: Open-Source KVCache-Centric Large Model Inference Architecture Co-Developed by Alibaba Cloud and Tsinghua University
Architect
Architect
Jul 2, 2024 · Artificial Intelligence

Mooncake: A Separated Architecture for Large‑Language‑Model Inference

The article presents Mooncake, a split‑architecture inference platform for the Kimi LLM assistant, detailing its three elastic resource pools, the rationale for using Time‑Between‑Tokens over TPOT, and design choices for Prefill, KVCache, and Decode stages to improve latency and throughput.

AI systemsKVCacheLLM inference
0 likes · 9 min read
Mooncake: A Separated Architecture for Large‑Language‑Model Inference