Can Linear Attention Complete Prefill-as-a-Service for Cross‑Datacenter Heterogeneous PD Separation?
The article analyzes why the massive KVCache bandwidth required by heterogeneous pre‑fill/ decode (PD) separation cannot be solved at the system level, proposes a Prefill‑as‑a‑Service architecture that leverages linear‑attention models to cut KVCache generation, and validates the design with a 1‑trillion‑parameter Kimi Linear deployment that achieves 54% higher throughput and 64% lower P90 TTFT across a 100 Gbps inter‑datacenter link.
Problem Statement
Heterogeneous prefill/decode (PD) separation faces two fundamental obstacles: the massive KVCache throughput required during the prefill stage and the operational complexity of mixing heterogeneous devices within a single datacenter. System‑level optimizations alone cannot bridge the bandwidth gap.
Prefill‑as‑a‑Service (PaaS) Architecture
The proposed architecture offloads the compute‑intensive prefill phase to a dedicated service. The service returns token IDs and the generated KVCache to downstream decoder instances, enabling decoders to run on separate, possibly lower‑memory GPUs.
KVCache Transmission Bottleneck
Using the Minimax M2.5 GQA model as a reference, a single prefill instance can produce KVCache at up to 60 Gbps . Scaling to many concurrent instances quickly exceeds the capacity of typical cross‑datacenter links, even though modern RDMA networks can handle high intra‑datacenter speeds.
Linear‑Attention Co‑Design
Linear‑attention mechanisms whose state size grows linearly with sequence length (e.g., Kimi Linear, MiMo, Ring) reduce per‑instance KVCache generation by more than tenfold in long‑text scenarios. This reduction makes a small number of dedicated inter‑datacenter fibers sufficient for KVCache transfer.
Routing Policy for Heterogeneous PD
Because bursty request patterns and uneven sequence lengths can still cause congestion, the authors propose a length‑aware routing policy:
Short incremental prefill requests are processed locally.
Long prefill requests may be forwarded to the remote PaaS.
The policy also accounts for network congestion and KVCache reuse hit rates.
Detailed design choices are described in Section 3 of the cited paper.
https://arxiv.org/html/2604.15039v1#S3Experimental Evaluation
The Kimi Linear architecture was scaled to a 1‑trillion‑parameter model (comparable to the K2 series) and deployed on a heterogeneous cluster consisting of NVIDIA H200 and H20 GPUs, connected by a dedicated 100 Gbps link.
Compared with a homogeneous datacenter configuration, the heterogeneous PaaS setup achieved:
54 % higher overall serving throughput
64 % lower P90 time‑to‑first‑token (TTFT)
In this configuration the PaaS cluster’s total egress bandwidth was approximately 13 Gbps, i.e., about 13 % of the 100 Gbps inter‑cluster capacity.
A baseline that sent all prefill work to the high‑compute cluster without length‑aware routing yielded only a 1.16× throughput gain, far below the 1.54× improvement of the full solution.
Conclusion
The results demonstrate that co‑designing linear‑attention models with a Prefill‑as‑a‑Service architecture can make heterogeneous PD feasible across datacenter boundaries. The current study is a proof‑of‑concept; further rigorous analysis and optimization are required for production deployment.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
