Boosting LLM Inference: RoleBasedGroup & Mooncake for Stable, High‑Performance Service
Large language model inference faces memory pressure, but by externalizing KVCache with Mooncake and orchestrating roles via the Kubernetes‑native RoleBasedGroup (RBG), developers can achieve stable, high‑throughput, cost‑effective serving with seamless in‑place upgrades and topology‑aware performance.
Background
Large language model (LLM) inference services are becoming core infrastructure for enterprise AI applications. Production deployment requires a balance of performance, stability, and cost. As model sizes grow, KVCache memory consumption can exceed 70% of GPU HBM, making on‑device caching unsustainable for long contexts or high concurrency.
Key Challenges in LLM Inference
Rapid evolution of distributed inference architectures (Prefill/Decode separation, Attention‑FFN separation, KVCache offloading).
High sensitivity to hardware topology (GPU‑NVLink, PCIe, RDMA) and latency metrics such as TTFT and TPOT.
Strong coupling between roles (Prefill ↔ Decode) that makes version upgrades and rollbacks error‑prone.
Low operational efficiency: manual coordination for restarts, scaling, and fault recovery consumes up to 5% of daily operational time.
Significant resource under‑utilization: GPU utilization often stays below 30% due to static provisioning.
Mooncake: Distributed KVCache Store
Mooncake is an industry‑grade distributed KVCache engine designed to address the memory bottleneck. It provides a high‑throughput, low‑latency KVCache service via a dedicated cache cluster, supporting multi‑replica, striped transfer, and hotspot load balancing.
Mooncake Core Components
Master Service : Manages storage pools, metadata, and node lifecycles.
Store Service : Offers distributed cache storage with replication, striping, and LRU‑plus‑high‑watermark eviction.
Key Features
RDMA‑accelerated zero‑copy data access for high bandwidth.
Intelligent prefetch and direct GPU transfer to maximize I/O efficiency.
Native support for Prefill‑Decode (PD) separation, boosting token‑level throughput.
RoleBasedGroup (RBG): Kubernetes‑Native Role Orchestration
RBG introduces the concept of "role as a first‑class citizen" and provides a unified API to manage multiple cooperating roles (e.g., Prefill, Decode, Mooncake) as a single service. It implements the SCOPE capability framework:
Stable : Deterministic, topology‑aware operations using a minimal replacement domain.
Coordination : Declarative dependency definitions for deployment, upgrade, fault handling, and scaling.
Orchestration : Role‑aware service discovery and ordered startup.
Performance : Topology‑aware scheduling with GPU‑NVLink > PCIe > RDMA > VPC priority.
Extensible : Declarative API + plugin mechanism for future architectures without core code changes.
Stable Example
roles:
- name: prefill
replicas: 3
rolloutStrategy:
type: InplaceIfPossible
maxUnavailable: 1Coordination Example
coordination:
- name: prefill-decode-co-update
type: RollingUpdate
roles:
- prefill
- decode
strategy:
maxUnavailable: 5%
maxSkew: 1%
partition: 20%Deploying a PD‑Separated Service with Mooncake
The deployment consists of the following roles:
SGLang Router : Unified request entry and traffic scheduler.
Prefill Serving Backend : Generates initial KVCache for prompts.
Decode Serving Backend : Performs token‑by‑token generation using cached KVCache.
Mooncake Master/Store : Provides external KVCache storage, enabling cache sharing across requests and eliminating GPU memory limits.
Using the RBG YAML (see the repository link), the entire system can be launched with a single command. The YAML defines role replicas, rollout strategies, and the Mooncake image.
kubectl patch rolebasedgroup sglang-pd-with-mooncake-demo \
--type='json' \
-p='[{"op": "replace", "path": "/spec/roles/1/template/spec/containers/0/image", "value": "lmsysorg/sglang:v0.5.6"}]'After applying the patch, only the Mooncake containers restart, preserving pod IPs and topology information thanks to in‑place upgrade.
Benchmark Results
Multi‑turn dialogue benchmarks show the impact of hierarchical caching:
Baseline (GPU only) : TTFT 5.91 s, P90 12.16 s, 6.58 k token/s.
L2 DRAM HiCache : KVCache hit rate 40.62 %, TTFT ↓36.2 % to 3.77 s, throughput ↑52.9 % to 10.05 k token/s.
L3 Mooncake : Hit rate further improves, TTFT ↓56.3 % to 2.58 s, P90 ↓42.7 % to 6.97 s, throughput ↑49.4 % to 15.02 k token/s.
These results demonstrate that multi‑level caching dramatically reduces latency and increases token throughput, especially in long‑context, RAG, and AI‑Agent scenarios.
Seamless Upgrade with Mooncake Persistence
Traditional rolling updates cause KVCache loss, leading to P99 latency spikes and throughput cliffs. Mooncake now supports persisting KVCache metadata to shared memory or local NVMe (PR #1031). Combined with RBG’s in‑place upgrade, the cache survives container restarts, eliminating cache‑miss penalties.
During an upgrade from lmsysorg/sglang:v0.5.5 to v0.5.6, only the Mooncake Store pods were restarted once, and their network/topology remained unchanged, confirming successful state preservation.
Conclusion and Outlook
RBG redefines LLM inference orchestration by treating roles as first‑class citizens, enabling deterministic operations, coordinated upgrades, and topology‑aware scheduling.
Mooncake unlocks unlimited KVCache capacity, achieving up to 56 % TTFT reduction and sustained GPU utilization.
Hierarchical caching (GPU → DRAM → Mooncake) is essential for long‑context inference and cost‑effective scaling.
The combined RBG + Mooncake approach demonstrates that deep integration of high‑performance system design with cloud‑native operations is the key to production‑grade AI infrastructure.
Future work includes extending the declarative API for new routing layers, further optimizing RDMA pathways, and expanding community contributions.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
