Artificial Intelligence 9 min read

How DualPath Eliminates Storage Bandwidth Bottlenecks in Agentic LLM Inference

This article analyzes the DualPath architecture that redesigns KV‑Cache data paths to overcome storage‑NIC saturation in Prefill‑Decode LLM systems, presenting theoretical proofs, detailed engineering solutions, and extensive offline and online benchmarks that demonstrate up to 2.25× performance gains.

PaperAgent

Feb 27, 2026

How DualPath Eliminates Storage Bandwidth Bottlenecks in Agentic LLM Inference

Background

Agentic large‑language‑model (LLM) workloads require ultra‑long contexts (hundreds of thousands of tokens), KV‑Cache hit rates of at least 95 %, and short‑append generation. In this regime the dominant system bottleneck shifts from GPU computation to I/O: each generation step must load the full KV‑Cache from storage.

Problem with the conventional Prefill‑Decode architecture

Modern inference pipelines separate the work into a Prefill Engine (PE) and a Decode Engine (DE). The PE performs the heavy computation and continuously reads KV‑Cache from storage, fully saturating the storage NIC (≈100 % utilization). The DE, which only performs self‑regressive token generation, leaves its NIC almost idle. This asymmetric NIC usage limits overall throughput and wastes available bandwidth.

DualPath design

DualPath introduces a second, independent data path that exploits the idle DE NIC bandwidth to load KV‑Cache and then transfers the data to the PE over a high‑speed RDMA link. The two paths are:

PE Read Path : Storage → PE Buffer → PE GPU → DE Buffer (traditional).

DE Read Path : Storage → DE Buffer → PE GPU (RDMA) → DE Buffer (new).

By aggregating the storage bandwidth of both engines, the storage‑NIC becomes the only shared resource and the single‑point bottleneck disappears.

Key technical challenges and solutions

Fine‑grained data transfer : Layer‑wise prefill combined with a mixed Layer‑Block/Full‑Block layout enables streaming of KV‑Cache while computation proceeds, reducing idle time.

Traffic isolation : All GPU traffic passes through the compute NIC; InfiniBand virtual lanes (VL) and QoS separate KV‑Cache traffic from model communication, preventing interference.

Dynamic load balancing : A two‑level scheduler (inter‑engine and intra‑engine) continuously balances NIC traffic with GPU compute, adapting to varying Prefill/Decode ratios.

Theoretical analysis

Mathematical modeling of a typical node (8 GPUs, 1 storage NIC) shows that DualPath can operate without bottlenecks across a wide range of Prefill/Decode ratios. The model predicts:

Full utilization of the storage NIC.

No DRAM or compute‑NIC limits.

Zero network congestion.

Experimental evaluation

Offline inference (RL rollout)

Speed‑up over a baseline Prefill‑Decode system:

DeepSeek‑V3.2 660B (2 Prefill + 4 Decode) – 1.87× faster.

DeepSeek 27B (1 Prefill + 1 Decode) – 1.78× faster.

Qwen2.5‑32B (1 Prefill + 2 Decode) – comparable improvement.

Gains increase with larger batch sizes and longer contexts, and are most pronounced in short‑append generation scenarios typical of agents.

Online service

DeepSeek 27B – 1.67× higher throughput.

DeepSeek 660B – 2.25× higher throughput while keeping first‑token latency ≤ 4 s and per‑token latency ≤ 50 ms.

Large‑scale scaling

1,152 GPU cluster: offline job‑completion time 3,201 s, close to linear scaling.

Online service: 22× throughput increase with stable latency.

Ablation study (DeepSeek 660B, 64 K context)

Layer‑wise prefill reduces job‑completion time by 17.21 %.

Adding Dual‑Path loading brings cumulative reduction to 38.19 %.

Introducing the two‑level scheduler adds up to 45.62 % cumulative reduction.

Storage‑NIC traffic balance improves from a round‑robin ratio of 1.53 to 1.18.

Attention‑layer max/avg execution‑time ratio drops to 1.06, reducing GPU idle bubbles.

Conclusion

DualPath redesigns KV‑Cache pathways, shifting the limiting factor from storage‑bandwidth to compute. The approach provides a scalable foundation for long‑context, multi‑round agent systems.

https://arxiv.org/pdf/2602.21548
DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

Performance optimization LLM inference Storage bandwidth DualPath

Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Problem with the conventional Prefill‑Decode architecture

DualPath design

Key technical challenges and solutions

Theoretical analysis

Experimental evaluation

Offline inference (RL rollout)

Online service

Large‑scale scaling

Ablation study (DeepSeek 660B, 64 K context)

Conclusion

PaperAgent

How this landed with the community

Was this worth your time?

0 Comments

Ablation study (DeepSeek 660B, 64 K context)