Artificial Intelligence 14 min read

How Baidu Cloud Achieved 4µs End-to-End Latency for Large-Scale PD Inference

Baidu Intelligent Cloud built a 4µs end-to-end low‑latency HPN cluster, optimized traffic management and communication operators, and introduced dynamic expert balancing to dramatically improve the performance of large‑scale PD‑separated inference services, showcasing the deep integration of network infrastructure with AI workloads.

Baidu Intelligent Cloud Tech Hub

May 16, 2025

How Baidu Cloud Achieved 4µs End-to-End Latency for Large-Scale PD Inference

Adapting PD‑Parallel Inference with 4µs End-to-End Low Latency HPN Cluster

To meet the demands of large‑scale PD‑separated inference, Baidu Intelligent Cloud built a 4µs end‑to‑end low‑latency HPN cluster, optimized traffic management, and refined communication operators, dramatically improving overall inference service performance.

1. Network Requirements for PD‑Separated Inference

Traditional inference services are centralized with modest bandwidth needs. In large‑scale PD‑separated systems, expert parallelism (EP) scales up, causing All‑to‑All traffic to grow dramatically, and KV‑Cache communication latency directly impacts service performance.

Massive EP All‑to‑All traffic increases demand on network infrastructure and operators.

KV‑Cache traffic between Prefill and Decode stages adds latency.

2. Solutions and Best Practices

2.1 Build HPN Network for All‑to‑All Traffic

Physical layer: Construct a “4µs end‑to‑end low‑latency” HPN cluster with adaptive routing to eliminate hash conflicts.

Traffic management: Queue All‑to‑All traffic separately, allocate more buffers and bandwidth to high‑priority queues, and disable ECN for those queues.

Communication components: Optimize All‑to‑All operators, achieve ~20% throughput gain, and keep expert load balance below 1.2.

2.2 Manage All‑to‑All and KV‑Cache Traffic

Prioritize All‑to‑All traffic in NIC queues and reserve bandwidth.

Isolate KV‑Cache traffic on a dedicated DCN network with elastic RDMA for full‑bandwidth transmission.

2.3 Enhance Inference Service Communication Efficiency

Improve All‑to‑All operator performance (≈20% faster) and overlap computation with communication.

Dynamic redundant expert encoding ensures balanced EP load.

Dual‑stream optimization further raises overall throughput by over 20%.

3. Summary

Baidu Intelligent Cloud’s large‑scale PD‑separated inference optimizations demonstrate the critical importance of tightly integrating network infrastructure, communication components, and AI workload characteristics to achieve low‑latency, high‑throughput AI services.

network optimization AI inference low-latency HPN KV cache All-to-All

Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.