How to Build a Low‑Latency, Lossless RoCE Network for High‑Performance Data Centers
This article explains how to design a low‑overhead, high‑performance lossless RoCE network for data centers, covering RDMA basics, mainstream network options, QoS, lossless and congestion‑control designs, buffer management, deadlock analysis, and practical tuning to achieve sub‑100 µs latency and near‑full bandwidth utilization.
Low‑overhead High‑Performance Lossless Network Selection
E‑commerce, live streaming and other latency‑sensitive services require ultra‑fast request‑response cycles. Rapid advances in compute and storage drive HPC, distributed training clusters, and hyper‑converged infrastructure, making network performance a primary bottleneck. To address this, we designed a low‑overhead, high‑performance RoCE network and built a low‑latency, lossless Ethernet data‑center as the foundation for RDMA and future physical network upgrades.
Why RDMA?
Traditional intra‑datacenter traffic uses the system‑level TCP/IP stack or DPDK, both of which consume significant CPU resources for protocol processing. RDMA offloads the entire protocol stack to the NIC, eliminating CPU overhead and dramatically reducing processing latency.
RDMA Requirements for Lossless Networks
Minimal packet loss – retransmission adds large latency.
Maximum throughput – fully utilized links.
Ultra‑low latency – even 100 µs is considered long.
Major RDMA Network Options
1. InfiniBand – Redesigns PHY, network and transport layers, requires dedicated InfiniBand switches, high cost but best performance.
2. iWARP – Maps RDMA onto TCP/IP, uses TCP for losslessness, but incurs high TCP overhead and lower performance.
3. RoCEv2 – Extends RoCE to standard Ethernet, uses PFC for losslessness and DCQCN for congestion control, turning Ethernet into a lossless fabric.
Network Design Goals
Full bandwidth utilization under all traffic models.
Minimize buffer usage.
Even when buffers are full, avoid packet loss.
Three‑Step Design Approach
QoS Design – Define queues, scheduling and shaping actions.
Lossless Design – Use PFC to guarantee no loss under congestion.
Congestion‑Control Design – Deploy DCQCN to throttle sources when congestion is detected.
QoS Design Details
We classify traffic by DSCP, TOS or COS marks. At the IDC edge we capture packet features and rewrite DSCP; inside the IDC we trust DSCP for simple, high‑speed forwarding. Example policies:
ToR downlink and border uplink ports: capture specific packets and map to dedicated queues.
All other ports: trust DSCP and map to corresponding queues.
Lossless Design and Analysis
RoCE traffic runs in lossless queues protected by PFC. The Broadcom XGS series provides a Memory Management Unit (MMU) that tracks cells per ingress/egress. Buffers are divided into PG‑Guaranteed, PG‑Share, Queue‑Guaranteed, Queue‑Share and Headroom waterlines. PG‑Guaranteed and Queue‑Guaranteed are reserved; PG‑Share and Queue‑Share are shared and trigger PFC when PG‑Share reaches its threshold.
Buffer Waterline Configuration
Key formulas:
(PG‑Share + PG‑Guarantee + Headroom) × [number of ingress ports] ≤ Queue‑Share + Queue‑Guarantee
PG‑Share = [remaining buffer] × α (α is a tunable scaling factor).
Headroom size is calculated as:
[Headroom] = [Tm1 + Tr1 + Tm2 + 2 × cable length / signal speed] × port rate / bits per 64‑byte packet
Deadlock Analysis and Mitigation
Deadlocks arise from circular buffer dependencies (CBD). In a stable CLOS topology CBD is absent, but during network convergence micro‑loops can create temporary CBD. We mitigate deadlocks by:
Designing convergence protocols to avoid micro‑loops.
Enabling switch deadlock detection and buffer release (accepting occasional loss).
Raising PG‑Share thresholds and leveraging DCQCN to suppress congestion.
Congestion‑Control Design and Analysis
RoCE uses the DCQCN algorithm (see “Congestion Control for Large‑Scale RDMA Deployments”). RP (reaction point) sends at full rate; when ECN marks a packet, the NP (notification point) replies with a CNP, causing RP to throttle. RP ramps up again if no CNP is received.
Key parameters on the congestion point (switch) are Kmin, Kmax and Pmax of the WRED‑ECN curve. Kmin defines the minimum queue length before marking begins, directly influencing base latency; Kmax caps the queue length where marking reaches 100 %.
Practical Tuning and Experiments
We set small targets: >95 % server port throughput, PFC pause rate <5 pps for 99 % of time, end‑to‑end latency ≤80 µs (90 % ≤40 µs). By testing over 50 traffic models we identified parameter sets that satisfy all goals, demonstrating that DCQCN tuning is complex but achievable.
Conclusion
By implementing QoS, lossless Ethernet (PFC) and DCQCN congestion control, we equipped the physical network to support RoCE, enabling high‑performance services such as 1.2 M IOPS SSD storage and 25 Gbps intra‑datacenter bandwidth.
UCloud Tech
UCloud is a leading neutral cloud provider in China, developing its own IaaS, PaaS, AI service platform, and big data exchange platform, and delivering comprehensive industry solutions for public, private, hybrid, and dedicated clouds.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
