Artificial Intelligence 14 min read

How Baidu Cloud Achieved 4µs Low-Latency PD Inference with HPN Network Optimizations

To meet the demanding network requirements of large‑scale PD‑separated inference, Baidu Cloud built a 4 µs end‑to‑end low‑latency HPN cluster, optimized traffic management, adaptive routing, and custom Alltoall operators, resulting in up to 20 % throughput gains and reduced latency for both Prefill and Decode stages.

Baidu Geek Talk

May 19, 2025

How Baidu Cloud Achieved 4µs Low-Latency PD Inference with HPN Network Optimizations

Background

Large‑scale PD (Prefill‑Decode) separated inference systems place new demands on network bandwidth and latency compared with traditional centralized or small‑scale multi‑node inference. The introduction of many expert parallel (EP) workers multiplies Alltoall traffic, and KV‑Cache transfers between Prefill and Decode add further latency sensitivity.

Network Requirements of PD‑Separated Inference

Alltoall traffic grows dramatically as the number of EPs increases, directly affecting OTPS, TPOT and user experience.

KV‑Cache communication latency between Prefill and Decode becomes a critical bottleneck.

Baidu Cloud’s Multi‑Layer Optimizations

1. Physical Network – 4µs HPN Cluster

Baidu built a high‑performance network (HPN) cluster that guarantees sub‑4 µs end‑to‑end latency. The cluster features adaptive routing to eliminate hash collisions and ensures stable low‑latency paths.

2. Traffic Management

Separate queues for Alltoall traffic (high priority) and other traffic such as AllReduce (low priority).

Reserve larger buffers and higher bandwidth share for high‑priority queues.

Disable ECN on high‑priority queues and configure DCQCN to ignore micro‑bursts, mitigating incast‑induced slowdown.

3. Communication Component Optimizations

Alltoall operator : Custom implementation reduces communication time by ~20 % and keeps latency under 5 µs.

Dynamic redundant expert scheduling keeps expert load balance (max/avg token ratio < 1.2), preventing “fast‑slow” GPU disparity.

Dual‑stream design overlaps computation with communication, further boosting throughput.

4. KV‑Cache Transfer via Elastic RDMA

KV‑Cache traffic is isolated on a dedicated DCN network using a self‑developed high‑performance RDMA library. The library supports layered transmission and batch transfers, allowing KV‑Cache transfers to fully overlap with computation and achieve full‑bandwidth utilization.

Results

After applying the above optimizations, Baidu observed:

Alltoall communication latency reduced by ~5 %.

Overall inference throughput increased by more than 20 % for both Prefill and Decode stages.

KV‑Cache transfer time became fully overlapped with computation, eliminating it as a bottleneck.

Conclusion

The practice demonstrates that deep integration of network infrastructure, communication components, and workload characteristics is essential for high‑performance PD‑separated inference. By jointly optimizing physical topology, traffic management, and operator implementations, Baidu Cloud achieved sub‑4 µs latency and significant throughput gains, providing a reference architecture for large‑scale AI inference deployments.

AI inference Distributed Training network operations HPN KV cache Alltoall optimization low latency networking

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Network Requirements of PD‑Separated Inference

Ba​idu Cloud’s Multi‑Layer Optimizations

1. Physical Network – 4µs HPN Cluster

2. Traffic Management

3. Communication Component Optimizations

4. KV‑Cache Transfer via Elastic RDMA

Results

Conclusion

Baidu Geek Talk

How this landed with the community

Was this worth your time?

0 Comments

Baidu Cloud’s Multi‑Layer Optimizations