Prefill-as-a-Service Boosts LLM Inference Throughput by 54%
A joint Moonshot AI and Tsinghua study shows that the Prefill-as-a-Service (PrfaaS) architecture, enabled by hybrid‑attention models that shrink KVCache size, can offload long Prefill work to a remote cluster and, with dual‑timescale scheduling, achieve a 54% throughput gain over homogeneous PD deployment and 32% over naive heterogeneous setups.
