How DualPath Eliminates Storage Bandwidth Bottlenecks in Agentic LLM Inference

This article analyzes the DualPath architecture that redesigns KV‑Cache data paths to overcome storage‑NIC saturation in Prefill‑Decode LLM systems, presenting theoretical proofs, detailed engineering solutions, and extensive offline and online benchmarks that demonstrate up to 2.25× performance gains.

PaperAgent
PaperAgent
PaperAgent
How DualPath Eliminates Storage Bandwidth Bottlenecks in Agentic LLM Inference

Background

Agentic large‑language‑model (LLM) workloads require ultra‑long contexts (hundreds of thousands of tokens), KV‑Cache hit rates of at least 95 %, and short‑append generation. In this regime the dominant system bottleneck shifts from GPU computation to I/O: each generation step must load the full KV‑Cache from storage.

Problem with the conventional Prefill‑Decode architecture

Modern inference pipelines separate the work into a Prefill Engine (PE) and a Decode Engine (DE). The PE performs the heavy computation and continuously reads KV‑Cache from storage, fully saturating the storage NIC (≈100 % utilization). The DE, which only performs self‑regressive token generation, leaves its NIC almost idle. This asymmetric NIC usage limits overall throughput and wastes available bandwidth.

DualPath design

DualPath introduces a second, independent data path that exploits the idle DE NIC bandwidth to load KV‑Cache and then transfers the data to the PE over a high‑speed RDMA link. The two paths are:

PE Read Path : Storage → PE Buffer → PE GPU → DE Buffer (traditional).

DE Read Path : Storage → DE Buffer → PE GPU (RDMA) → DE Buffer (new).

By aggregating the storage bandwidth of both engines, the storage‑NIC becomes the only shared resource and the single‑point bottleneck disappears.

Key technical challenges and solutions

Fine‑grained data transfer : Layer‑wise prefill combined with a mixed Layer‑Block/Full‑Block layout enables streaming of KV‑Cache while computation proceeds, reducing idle time.

Traffic isolation : All GPU traffic passes through the compute NIC; InfiniBand virtual lanes (VL) and QoS separate KV‑Cache traffic from model communication, preventing interference.

Dynamic load balancing : A two‑level scheduler (inter‑engine and intra‑engine) continuously balances NIC traffic with GPU compute, adapting to varying Prefill/Decode ratios.

Theoretical analysis

Mathematical modeling of a typical node (8 GPUs, 1 storage NIC) shows that DualPath can operate without bottlenecks across a wide range of Prefill/Decode ratios. The model predicts:

Full utilization of the storage NIC.

No DRAM or compute‑NIC limits.

Zero network congestion.

Experimental evaluation

Offline inference (RL rollout)

Speed‑up over a baseline Prefill‑Decode system:

DeepSeek‑V3.2 660B (2 Prefill + 4 Decode) – 1.87× faster.

DeepSeek 27B (1 Prefill + 1 Decode) – 1.78× faster.

Qwen2.5‑32B (1 Prefill + 2 Decode) – comparable improvement.

Gains increase with larger batch sizes and longer contexts, and are most pronounced in short‑append generation scenarios typical of agents.

Online service

DeepSeek 27B – 1.67× higher throughput.

DeepSeek 660B – 2.25× higher throughput while keeping first‑token latency ≤ 4 s and per‑token latency ≤ 50 ms.

Large‑scale scaling

1,152 GPU cluster: offline job‑completion time 3,201 s, close to linear scaling.

Online service: 22× throughput increase with stable latency.

Ablation study (DeepSeek 660B, 64 K context)

Layer‑wise prefill reduces job‑completion time by 17.21 %.

Adding Dual‑Path loading brings cumulative reduction to 38.19 %.

Introducing the two‑level scheduler adds up to 45.62 % cumulative reduction.

Storage‑NIC traffic balance improves from a round‑robin ratio of 1.53 to 1.18.

Attention‑layer max/avg execution‑time ratio drops to 1.06, reducing GPU idle bubbles.

Conclusion

DualPath redesigns KV‑Cache pathways, shifting the limiting factor from storage‑bandwidth to compute. The approach provides a scalable foundation for long‑context, multi‑round agent systems.

Agent轨迹示意
Agent轨迹示意
现有瓶颈 vs DualPath
现有瓶颈 vs DualPath
双路径加载示意
双路径加载示意
Inter-engine PE调度
Inter-engine PE调度
Intra-engine调度
Intra-engine调度
离线推理性能对比
离线推理性能对比
在线服务延迟指标
在线服务延迟指标
大规模离线推理指标
大规模离线推理指标
存储NIC负载均衡
存储NIC负载均衡
https://arxiv.org/pdf/2602.21548
DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference
Performance optimizationLLM inferenceStorage bandwidthDualPath
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.