How Baidu Cloud Achieved 4µs Low-Latency PD Inference with HPN Network Optimizations
To meet the demanding network requirements of large‑scale PD‑separated inference, Baidu Cloud built a 4 µs end‑to‑end low‑latency HPN cluster, optimized traffic management, adaptive routing, and custom Alltoall operators, resulting in up to 20 % throughput gains and reduced latency for both Prefill and Decode stages.
Background
Large‑scale PD (Prefill‑Decode) separated inference systems place new demands on network bandwidth and latency compared with traditional centralized or small‑scale multi‑node inference. The introduction of many expert parallel (EP) workers multiplies Alltoall traffic, and KV‑Cache transfers between Prefill and Decode add further latency sensitivity.
Network Requirements of PD‑Separated Inference
Alltoall traffic grows dramatically as the number of EPs increases, directly affecting OTPS, TPOT and user experience.
KV‑Cache communication latency between Prefill and Decode becomes a critical bottleneck.
Baidu Cloud’s Multi‑Layer Optimizations
1. Physical Network – 4µs HPN Cluster
Baidu built a high‑performance network (HPN) cluster that guarantees sub‑4 µs end‑to‑end latency. The cluster features adaptive routing to eliminate hash collisions and ensures stable low‑latency paths.
2. Traffic Management
Separate queues for Alltoall traffic (high priority) and other traffic such as AllReduce (low priority).
Reserve larger buffers and higher bandwidth share for high‑priority queues.
Disable ECN on high‑priority queues and configure DCQCN to ignore micro‑bursts, mitigating incast‑induced slowdown.
3. Communication Component Optimizations
Alltoall operator : Custom implementation reduces communication time by ~20 % and keeps latency under 5 µs.
Dynamic redundant expert scheduling keeps expert load balance (max/avg token ratio < 1.2), preventing “fast‑slow” GPU disparity.
Dual‑stream design overlaps computation with communication, further boosting throughput.
4. KV‑Cache Transfer via Elastic RDMA
KV‑Cache traffic is isolated on a dedicated DCN network using a self‑developed high‑performance RDMA library. The library supports layered transmission and batch transfers, allowing KV‑Cache transfers to fully overlap with computation and achieve full‑bandwidth utilization.
Results
After applying the above optimizations, Baidu observed:
Alltoall communication latency reduced by ~5 %.
Overall inference throughput increased by more than 20 % for both Prefill and Decode stages.
KV‑Cache transfer time became fully overlapped with computation, eliminating it as a bottleneck.
Conclusion
The practice demonstrates that deep integration of network infrastructure, communication components, and workload characteristics is essential for high‑performance PD‑separated inference. By jointly optimizing physical topology, traffic management, and operator implementations, Baidu Cloud achieved sub‑4 µs latency and significant throughput gains, providing a reference architecture for large‑scale AI inference deployments.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
