How Baidu Cloud Achieved 4µs Low-Latency PD Inference with HPN Network Optimizations

To meet the demanding network requirements of large‑scale PD‑separated inference, Baidu Cloud built a 4 µs end‑to‑end low‑latency HPN cluster, optimized traffic management, adaptive routing, and custom Alltoall operators, resulting in up to 20 % throughput gains and reduced latency for both Prefill and Decode stages.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
How Baidu Cloud Achieved 4µs Low-Latency PD Inference with HPN Network Optimizations

Background

Large‑scale PD (Prefill‑Decode) separated inference systems place new demands on network bandwidth and latency compared with traditional centralized or small‑scale multi‑node inference. The introduction of many expert parallel (EP) workers multiplies Alltoall traffic, and KV‑Cache transfers between Prefill and Decode add further latency sensitivity.

Network Requirements of PD‑Separated Inference

Alltoall traffic grows dramatically as the number of EPs increases, directly affecting OTPS, TPOT and user experience.

KV‑Cache communication latency between Prefill and Decode becomes a critical bottleneck.

Ba​idu Cloud’s Multi‑Layer Optimizations

1. Physical Network – 4µs HPN Cluster

Ba​idu built a high‑performance network (HPN) cluster that guarantees sub‑4 µs end‑to‑end latency. The cluster features adaptive routing to eliminate hash collisions and ensures stable low‑latency paths.

HPN network architecture diagram
HPN network architecture diagram

2. Traffic Management

Separate queues for Alltoall traffic (high priority) and other traffic such as AllReduce (low priority).

Reserve larger buffers and higher bandwidth share for high‑priority queues.

Disable ECN on high‑priority queues and configure DCQCN to ignore micro‑bursts, mitigating incast‑induced slowdown.

3. Communication Component Optimizations

Alltoall operator : Custom implementation reduces communication time by ~20 % and keeps latency under 5 µs.

Dynamic redundant expert scheduling keeps expert load balance (max/avg token ratio < 1.2), preventing “fast‑slow” GPU disparity.

Dual‑stream design overlaps computation with communication, further boosting throughput.

4. KV‑Cache Transfer via Elastic RDMA

KV‑Cache traffic is isolated on a dedicated DCN network using a self‑developed high‑performance RDMA library. The library supports layered transmission and batch transfers, allowing KV‑Cache transfers to fully overlap with computation and achieve full‑bandwidth utilization.

Adaptive routing diagram
Adaptive routing diagram

Results

After applying the above optimizations, Baidu observed:

Alltoall communication latency reduced by ~5 %.

Overall inference throughput increased by more than 20 % for both Prefill and Decode stages.

KV‑Cache transfer time became fully overlapped with computation, eliminating it as a bottleneck.

Conclusion

The practice demonstrates that deep integration of network infrastructure, communication components, and workload characteristics is essential for high‑performance PD‑separated inference. By jointly optimizing physical topology, traffic management, and operator implementations, Baidu Cloud achieved sub‑4 µs latency and significant throughput gains, providing a reference architecture for large‑scale AI inference deployments.

AI inferenceDistributed Trainingnetwork operationsHPNKV cacheAlltoall optimizationlow latency networking
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.