How DualPath Eliminates Storage Bandwidth Bottlenecks in Agentic LLM Inference
This article analyzes the DualPath architecture that redesigns KV‑Cache data paths to overcome storage‑NIC saturation in Prefill‑Decode LLM systems, presenting theoretical proofs, detailed engineering solutions, and extensive offline and online benchmarks that demonstrate up to 2.25× performance gains.
Background
Agentic large‑language‑model (LLM) workloads require ultra‑long contexts (hundreds of thousands of tokens), KV‑Cache hit rates of at least 95 %, and short‑append generation. In this regime the dominant system bottleneck shifts from GPU computation to I/O: each generation step must load the full KV‑Cache from storage.
Problem with the conventional Prefill‑Decode architecture
Modern inference pipelines separate the work into a Prefill Engine (PE) and a Decode Engine (DE). The PE performs the heavy computation and continuously reads KV‑Cache from storage, fully saturating the storage NIC (≈100 % utilization). The DE, which only performs self‑regressive token generation, leaves its NIC almost idle. This asymmetric NIC usage limits overall throughput and wastes available bandwidth.
DualPath design
DualPath introduces a second, independent data path that exploits the idle DE NIC bandwidth to load KV‑Cache and then transfers the data to the PE over a high‑speed RDMA link. The two paths are:
PE Read Path : Storage → PE Buffer → PE GPU → DE Buffer (traditional).
DE Read Path : Storage → DE Buffer → PE GPU (RDMA) → DE Buffer (new).
By aggregating the storage bandwidth of both engines, the storage‑NIC becomes the only shared resource and the single‑point bottleneck disappears.
Key technical challenges and solutions
Fine‑grained data transfer : Layer‑wise prefill combined with a mixed Layer‑Block/Full‑Block layout enables streaming of KV‑Cache while computation proceeds, reducing idle time.
Traffic isolation : All GPU traffic passes through the compute NIC; InfiniBand virtual lanes (VL) and QoS separate KV‑Cache traffic from model communication, preventing interference.
Dynamic load balancing : A two‑level scheduler (inter‑engine and intra‑engine) continuously balances NIC traffic with GPU compute, adapting to varying Prefill/Decode ratios.
Theoretical analysis
Mathematical modeling of a typical node (8 GPUs, 1 storage NIC) shows that DualPath can operate without bottlenecks across a wide range of Prefill/Decode ratios. The model predicts:
Full utilization of the storage NIC.
No DRAM or compute‑NIC limits.
Zero network congestion.
Experimental evaluation
Offline inference (RL rollout)
Speed‑up over a baseline Prefill‑Decode system:
DeepSeek‑V3.2 660B (2 Prefill + 4 Decode) – 1.87× faster.
DeepSeek 27B (1 Prefill + 1 Decode) – 1.78× faster.
Qwen2.5‑32B (1 Prefill + 2 Decode) – comparable improvement.
Gains increase with larger batch sizes and longer contexts, and are most pronounced in short‑append generation scenarios typical of agents.
Online service
DeepSeek 27B – 1.67× higher throughput.
DeepSeek 660B – 2.25× higher throughput while keeping first‑token latency ≤ 4 s and per‑token latency ≤ 50 ms.
Large‑scale scaling
1,152 GPU cluster: offline job‑completion time 3,201 s, close to linear scaling.
Online service: 22× throughput increase with stable latency.
Ablation study (DeepSeek 660B, 64 K context)
Layer‑wise prefill reduces job‑completion time by 17.21 %.
Adding Dual‑Path loading brings cumulative reduction to 38.19 %.
Introducing the two‑level scheduler adds up to 45.62 % cumulative reduction.
Storage‑NIC traffic balance improves from a round‑robin ratio of 1.53 to 1.18.
Attention‑layer max/avg execution‑time ratio drops to 1.06, reducing GPU idle bubbles.
Conclusion
DualPath redesigns KV‑Cache pathways, shifting the limiting factor from storage‑bandwidth to compute. The approach provides a scalable foundation for long‑context, multi‑round agent systems.
https://arxiv.org/pdf/2602.21548
DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM InferenceHow this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
