Prefill-as-a-Service Boosts LLM Inference Throughput by 54%
A joint Moonshot AI and Tsinghua study shows that the Prefill-as-a-Service (PrfaaS) architecture, enabled by hybrid‑attention models that shrink KVCache size, can offload long Prefill work to a remote cluster and, with dual‑timescale scheduling, achieve a 54% throughput gain over homogeneous PD deployment and 32% over naive heterogeneous setups.
Prefill‑as‑a‑Service (PrfaaS)
PrfaaS moves the Prefill stage of large‑language‑model (LLM) inference to a remote high‑compute cluster, allowing the Prefill and Decode stages to run in different data‑centers.
Why cross‑data‑center?
In traditional Prefill‑Decode (PD) disaggregation the Prefill and Decode stages share the same rack because the KVCache generated by Prefill can require up to ~60 Gbps for a 32 K‑token request on dense models (e.g., MiniMax‑M2.5). Ordinary Ethernet cannot sustain such bandwidth, so heterogeneous inference is limited by KVCache transfer costs.
Hybrid‑attention models reduce KVCache bandwidth
Hybrid‑attention architectures keep only a few full‑attention layers while replacing the rest with linear attention (LA) or sliding‑window attention (SWA). LA and SWA layers produce a fixed‑size KVCache that does not grow linearly with input length, dramatically lowering KVCache throughput.
MiniMax‑M2.5 (dense, all GQA) – ~60 Gbps
Qwen3‑235B (dense, all MLA) – ~33 Gbps
Qwen3.5‑397B (3:1 LA:Full) – ~8 Gbps
MiMo‑V2‑Flash (5:1 SWA:Full) – ~4.7 Gbps
Ring‑2.5‑1T (7:1 LA:Full) – much lower, up to 36× reduction compared with dense models
These reductions make ordinary Ethernet viable for cross‑cluster KVCache transfer.
Core design of PrfaaS
PrfaaS employs selective offloading : a request is sent to the remote Prefill cluster only if its incremental Prefill length (after prefix cache hits) exceeds a threshold t. Shorter requests remain in the local PD cluster.
Compute subsystem
Remote PrfaaS cluster with high‑compute GPUs (e.g., NVIDIA H200) dedicated to long‑context Prefill.
Local PD cluster with conventional GPUs (e.g., NVIDIA H20) handling short Prefill and all Decode.
Network subsystem
RDMA interconnect within each cluster.
Ordinary Ethernet (VPC peering or dedicated line) between clusters.
Storage subsystem
Hybrid prefix cache pool that manages two KVCache types:
Fixed‑size cache blocks for linear‑attention layers (exact‑match reusable).
Variable‑size cache blocks for full‑attention layers (prefix‑matchable).
Both block types share a common memory pool; a global KVCache manager tracks metadata and informs routing decisions.
Dual‑time‑scale scheduling
Short‑term scheduling routes a request to the remote cluster when its incremental Prefill length > t. The scheduler also monitors the remote cluster’s egress link utilization and queue depth, preferring local processing when bandwidth is scarce and allowing cross‑cluster cache migration when bandwidth is abundant.
Long‑term scheduling adapts to traffic patterns by recomputing the optimal Prefill‑to‑Decode instance ratio Np/Nd and updating the routing threshold t as workload characteristics shift.
Experimental evaluation
The authors evaluated a 1‑trillion‑parameter hybrid model (Kimi Linear architecture, 3:1 KDA:MLA layer ratio) under realistic workloads.
Hardware configuration
PrfaaS cluster: 32 × H200 GPUs (high‑compute Prefill).
Local PD cluster: 64 × H20 GPUs, 800 Gbps RDMA.
Cross‑cluster link: ~100 Gbps VPC network.
Baseline: 96 × H20 GPUs in a homogeneous PD cluster.
Workload characteristics
Input length follows a log‑normal distribution, mean ≈ 27 K tokens, range 128–128 K.
Output length fixed at 1 024 tokens.
SLO: 40 tokens / s.
Parameter search identified the optimal routing threshold t = 19.4 K tokens. The local PD cluster runs 3 Prefill instances and 5 Decode instances, with roughly 50 % of long requests offloaded to the PrfaaS cluster.
Key results:
Throughput improvement relative to a homogeneous PD baseline: +54 % (PrfaaS) vs +32 % for a naïve heterogeneous scheme that sends all Prefill to H200 GPUs.
P90 total‑to‑first‑token latency (TTFT) improves by +64 % over the homogeneous baseline.
Average cross‑cluster egress bandwidth for the PrfaaS cluster is 13 Gbps , i.e., 13 % of the 100 Gbps link, leaving ample headroom.
A naïve heterogeneous approach (all Prefill to H200) yields only a 16 % throughput gain, highlighting the importance of the dual‑time‑scale scheduling.
Implications
Hybrid‑attention models (Kimi Linear, Qwen3.5, MiMo‑V2‑Flash, Ring‑2.5‑1T) shrink KVCache enough to make cross‑data‑center inference practical.
Specialized hardware (e.g., NVIDIA Rubin CPX for Prefill, Groq LPU for Decode, Taalas HC1 for memory bandwidth) can be deployed independently, without forcing heterogeneous chips into a single RDMA cluster.
Even at ten‑thousand‑GPU scale, cross‑cluster bandwidth requirements remain in the Tbps range, enabling cost‑optimized placement of Prefill clusters in compute‑cheap locations and Decode clusters near end‑users.
Conclusion
Next‑generation hybrid‑attention LLMs produce KVCache small enough for Ethernet‑scale transfer, but practical cross‑data‑center inference requires a system that combines selective offloading, bandwidth‑aware short‑term scheduling, and traffic‑driven long‑term resource reallocation. The PrfaaS design demonstrates that coupling model architecture with such infrastructure yields a 54 % throughput gain over homogeneous PD deployments.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
