How Baidu Cloud Achieved 4µs End-to-End Latency for Large-Scale PD Inference
Baidu Intelligent Cloud built a 4µs end-to-end low‑latency HPN cluster, optimized traffic management and communication operators, and introduced dynamic expert balancing to dramatically improve the performance of large‑scale PD‑separated inference services, showcasing the deep integration of network infrastructure with AI workloads.
Adapting PD‑Parallel Inference with 4µs End-to-End Low Latency HPN Cluster
To meet the demands of large‑scale PD‑separated inference, Baidu Intelligent Cloud built a 4µs end‑to‑end low‑latency HPN cluster, optimized traffic management, and refined communication operators, dramatically improving overall inference service performance.
1. Network Requirements for PD‑Separated Inference
Traditional inference services are centralized with modest bandwidth needs. In large‑scale PD‑separated systems, expert parallelism (EP) scales up, causing All‑to‑All traffic to grow dramatically, and KV‑Cache communication latency directly impacts service performance.
Massive EP All‑to‑All traffic increases demand on network infrastructure and operators.
KV‑Cache traffic between Prefill and Decode stages adds latency.
2. Solutions and Best Practices
2.1 Build HPN Network for All‑to‑All Traffic
Physical layer: Construct a “4µs end‑to‑end low‑latency” HPN cluster with adaptive routing to eliminate hash conflicts.
Traffic management: Queue All‑to‑All traffic separately, allocate more buffers and bandwidth to high‑priority queues, and disable ECN for those queues.
Communication components: Optimize All‑to‑All operators, achieve ~20% throughput gain, and keep expert load balance below 1.2.
2.2 Manage All‑to‑All and KV‑Cache Traffic
Prioritize All‑to‑All traffic in NIC queues and reserve bandwidth.
Isolate KV‑Cache traffic on a dedicated DCN network with elastic RDMA for full‑bandwidth transmission.
2.3 Enhance Inference Service Communication Efficiency
Improve All‑to‑All operator performance (≈20% faster) and overlap computation with communication.
Dynamic redundant expert encoding ensures balanced EP load.
Dual‑stream optimization further raises overall throughput by over 20%.
3. Summary
Baidu Intelligent Cloud’s large‑scale PD‑separated inference optimizations demonstrate the critical importance of tightly integrating network infrastructure, communication components, and AI workload characteristics to achieve low‑latency, high‑throughput AI services.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
