How DualPath Revives Idle Network Cards to Break Long‑Context I/O Bottlenecks in DeepSeek V4
The article analyzes the KV‑Cache storage I/O bottleneck that limits agentic LLM inference, introduces the DualPath architecture with a storage‑to‑decode data path and RDMA‑based scheduling, and shows up to 1.87× offline and 1.96× online throughput gains on large‑scale GPU clusters.
Problem Statement
As large language models evolve into agentic systems, inference workloads involve extremely long contexts (tens of thousands of tokens) and frequent short updates. In the prevalent Prefill‑Decode separation architecture, the KV‑Cache is stored remotely and loaded by the Prefill node, causing the storage NIC on Prefill to stay saturated while the Decode NIC remains idle. KV‑Cache hit rates exceed 95%, so GPU compute cores spend most of their time waiting for cache data, turning the system into an I/O‑bound workload. For DeepSeek‑V3.2, the cache‑compute ratio reaches 22 GB/PFLOP, imposing a severe bandwidth demand.
DualPath Architecture
DualPath adds a second data path that reads KV‑Cache directly into the Decode engine’s memory buffer and then forwards the data to the Prefill engine over high‑bandwidth compute NICs using the RDMA protocol. This storage‑to‑decode path, combined with a global scheduler that dynamically allocates traffic between the two paths, transforms the previously single‑point I/O bottleneck into a globally shared high‑throughput resource pool.
Implementation Details
The system places the compute NIC at the center of traffic control. All host‑to‑device copies are handed to the compute NIC, which leverages hardware QoS (Virtual Lanes on InfiniBand, DSCP/TC on RoCE) to allocate ~99 % of bandwidth to inference traffic and use the remaining bandwidth for KV‑Cache movement. The scheduler monitors two health metrics per node—local disk read queue length and pending token count—and prefers nodes with short queues and surplus compute capacity, preventing queue buildup on any single node.
Within the Prefill engine, a hierarchical pre‑fill strategy and a token‑chunking mechanism split the forward pass into balanced blocks, forcing all GPUs in a group to finish at the same time. This reduces the Max/Avg attention kernel execution time ratio from 1.53 to 1.06 in the first 5 % of execution, eliminating compute bubbles.
Experimental Evaluation
Tests were run on a production‑grade cluster with 1152 NVIDIA Hopper GPUs, evaluating DeepSeek‑V3.2 660B, Qwen2.5‑32B (GQA architecture), and an internal DS 27B model. Offline rollout workloads with 32K–64K context achieved a 1.87× speed‑up in task completion time, approaching the zero‑I/O “Oracle” limit. Online service latency constraints (TTFT ≤ 4 s, TPOT ≤ 50 ms) saw a 1.96× increase in average concurrent requests (APS). Load‑balancing metrics improved: Max/Avg storage NIC traffic dropped from 1.53 to 1.18, and attention kernel Max/Avg time fell to 1.06.
Ablation studies showed that hierarchical pre‑fill alone reduced task time by 17.21 %; adding DualPath increased the reduction to 38.19 %; the full dynamic scheduler yielded a total job‑completion‑time reduction of 45.62 % versus the unoptimized baseline.
Scalability and Limits
Scaling experiments from 2P4D to 48P96D demonstrated near‑linear throughput growth, with a 22× increase when expanding to 44P88D. However, on smaller models (DS 27B), both Basic and DualPath configurations still suffered noticeable TPOT overhead, indicating that cross‑node Prefill‑Decode transfer costs dominate inference for modest model sizes.
Conclusion
DualPath precisely identifies storage bandwidth as the hard limit of agentic LLM inference and restructures the idle NIC into a globally shared high‑throughput pool, extending the performance envelope of data‑center clusters for ultra‑long‑context, multi‑turn tasks. The architecture’s success suggests a strong foundation for future DeepSeek V4 models targeting multi‑agent collaboration.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
