Prefill-as-a-Service Boosts LLM Inference Throughput by 54%

A joint Moonshot AI and Tsinghua study shows that the Prefill-as-a-Service (PrfaaS) architecture, enabled by hybrid‑attention models that shrink KVCache size, can offload long Prefill work to a remote cluster and, with dual‑timescale scheduling, achieve a 54% throughput gain over homogeneous PD deployment and 32% over naive heterogeneous setups.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
Prefill-as-a-Service Boosts LLM Inference Throughput by 54%

Prefill‑as‑a‑Service (PrfaaS)

PrfaaS moves the Prefill stage of large‑language‑model (LLM) inference to a remote high‑compute cluster, allowing the Prefill and Decode stages to run in different data‑centers.

Why cross‑data‑center?

In traditional Prefill‑Decode (PD) disaggregation the Prefill and Decode stages share the same rack because the KVCache generated by Prefill can require up to ~60 Gbps for a 32 K‑token request on dense models (e.g., MiniMax‑M2.5). Ordinary Ethernet cannot sustain such bandwidth, so heterogeneous inference is limited by KVCache transfer costs.

Hybrid‑attention models reduce KVCache bandwidth

Hybrid‑attention architectures keep only a few full‑attention layers while replacing the rest with linear attention (LA) or sliding‑window attention (SWA). LA and SWA layers produce a fixed‑size KVCache that does not grow linearly with input length, dramatically lowering KVCache throughput.

MiniMax‑M2.5 (dense, all GQA) – ~60 Gbps

Qwen3‑235B (dense, all MLA) – ~33 Gbps

Qwen3.5‑397B (3:1 LA:Full) – ~8 Gbps

MiMo‑V2‑Flash (5:1 SWA:Full) – ~4.7 Gbps

Ring‑2.5‑1T (7:1 LA:Full) – much lower, up to 36× reduction compared with dense models

These reductions make ordinary Ethernet viable for cross‑cluster KVCache transfer.

Core design of PrfaaS

PrfaaS employs selective offloading : a request is sent to the remote Prefill cluster only if its incremental Prefill length (after prefix cache hits) exceeds a threshold t. Shorter requests remain in the local PD cluster.

Compute subsystem

Remote PrfaaS cluster with high‑compute GPUs (e.g., NVIDIA H200) dedicated to long‑context Prefill.

Local PD cluster with conventional GPUs (e.g., NVIDIA H20) handling short Prefill and all Decode.

Network subsystem

RDMA interconnect within each cluster.

Ordinary Ethernet (VPC peering or dedicated line) between clusters.

Storage subsystem

Hybrid prefix cache pool that manages two KVCache types:

Fixed‑size cache blocks for linear‑attention layers (exact‑match reusable).

Variable‑size cache blocks for full‑attention layers (prefix‑matchable).

Both block types share a common memory pool; a global KVCache manager tracks metadata and informs routing decisions.

Dual‑time‑scale scheduling

Short‑term scheduling routes a request to the remote cluster when its incremental Prefill length > t. The scheduler also monitors the remote cluster’s egress link utilization and queue depth, preferring local processing when bandwidth is scarce and allowing cross‑cluster cache migration when bandwidth is abundant.

Long‑term scheduling adapts to traffic patterns by recomputing the optimal Prefill‑to‑Decode instance ratio Np/Nd and updating the routing threshold t as workload characteristics shift.

Experimental evaluation

The authors evaluated a 1‑trillion‑parameter hybrid model (Kimi Linear architecture, 3:1 KDA:MLA layer ratio) under realistic workloads.

Hardware configuration

PrfaaS cluster: 32 × H200 GPUs (high‑compute Prefill).

Local PD cluster: 64 × H20 GPUs, 800 Gbps RDMA.

Cross‑cluster link: ~100 Gbps VPC network.

Baseline: 96 × H20 GPUs in a homogeneous PD cluster.

Workload characteristics

Input length follows a log‑normal distribution, mean ≈ 27 K tokens, range 128–128 K.

Output length fixed at 1 024 tokens.

SLO: 40 tokens / s.

Parameter search identified the optimal routing threshold t = 19.4 K tokens. The local PD cluster runs 3 Prefill instances and 5 Decode instances, with roughly 50 % of long requests offloaded to the PrfaaS cluster.

Key results:

Throughput improvement relative to a homogeneous PD baseline: +54 % (PrfaaS) vs +32 % for a naïve heterogeneous scheme that sends all Prefill to H200 GPUs.

P90 total‑to‑first‑token latency (TTFT) improves by +64 % over the homogeneous baseline.

Average cross‑cluster egress bandwidth for the PrfaaS cluster is 13 Gbps , i.e., 13 % of the 100 Gbps link, leaving ample headroom.

A naïve heterogeneous approach (all Prefill to H200) yields only a 16 % throughput gain, highlighting the importance of the dual‑time‑scale scheduling.

Implications

Hybrid‑attention models (Kimi Linear, Qwen3.5, MiMo‑V2‑Flash, Ring‑2.5‑1T) shrink KVCache enough to make cross‑data‑center inference practical.

Specialized hardware (e.g., NVIDIA Rubin CPX for Prefill, Groq LPU for Decode, Taalas HC1 for memory bandwidth) can be deployed independently, without forcing heterogeneous chips into a single RDMA cluster.

Even at ten‑thousand‑GPU scale, cross‑cluster bandwidth requirements remain in the Tbps range, enabling cost‑optimized placement of Prefill clusters in compute‑cheap locations and Decode clusters near end‑users.

Conclusion

Next‑generation hybrid‑attention LLMs produce KVCache small enough for Ethernet‑scale transfer, but practical cross‑data‑center inference requires a system that combines selective offloading, bandwidth‑aware short‑term scheduling, and traffic‑driven long‑term resource reallocation. The PrfaaS design demonstrates that coupling model architecture with such infrastructure yields a 54 % throughput gain over homogeneous PD deployments.

Distributed inferenceschedulingLLM inferenceHybrid attentionKVCache optimizationPrefill-as-a-Service
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.