Why RDMA Needs Lossless Networks: Layer‑2 vs Layer‑3 Traffic Classification & Flow Control
RDMA boosts data‑center performance by bypassing CPU‑intensive copies, but to realize its low‑latency promise you must build a lossless network using proper traffic classification—either Layer 2 PCP or Layer 3 DSCP—and combine PFC with ECN/DCQCN for precise congestion management.
Why RDMA is Needed
In the era of cloud computing and big data, growing business workloads demand ever‑higher storage I/O performance. Traditional TCP/IP incurs heavy CPU and memory overhead due to multiple data copies between system memory, CPU caches, and NIC buffers. RDMA (Remote Direct Memory Access) eliminates these copies by allowing user‑space applications to read and write remote memory directly, offloading the network stack to the NIC hardware and achieving high throughput, ultra‑low latency, and low CPU usage.
Differentiated Traffic Classification
To build a lossless network, traffic must first be classified so that appropriate flow‑control policies can be applied. Classification can be performed at Layer 2 using the VLAN PCP bits (Class of Service) or at Layer 3 using the IP DSCP field.
Layer 2 Traffic Classification
Layer 2 classification relies on the three‑bit PCP field in the VLAN tag, providing eight priority classes. Packets must carry a VLAN tag, so the NIC must be configured with the appropriate VLAN and priority. Because PFC at Layer 2 depends on VLAN, tag loss can occur when packets traverse three‑layer switches.
Layer 3 Traffic Classification
Layer 3 uses the first six bits of the IP header’s TOS field (DSCP), supporting 64 distinct traffic classes. The remaining two bits are used for Explicit Congestion Notification (ECN), an end‑to‑end flow‑control mechanism.
Choosing Layer 2 or Layer 3 Flow Control
When switches support DSCP, Layer 3 flow control is recommended because DSCP values remain unchanged end‑to‑end, allowing consistent classification across multiple switches. Since RoCE uses UDP, DSCP‑based flow control is preferred.
Building a Lossless Network
PFC Flow‑Control Based on DSCP or PCP
IEEE 802.1Qbb (Priority‑based Flow Control, PFC) extends IEEE 802.3X by creating eight virtual channels on an Ethernet link, each with its own priority. This enables pausing and resuming individual channels without affecting others, matching the eight hardware transmit queues on a Smart NIC.
In a Layer 2 network, PFC distinguishes flows using the VLAN PCP bits; in a Layer 3 network, it can use either PCP or DSCP. Modern data centers typically use Layer 3 with DSCP because the DSCP value is preserved across the entire path.
While PFC can prevent packet loss by pausing traffic before switch buffers overflow, it suffers from inefficiencies such as head‑of‑line blocking and unfairness. The most effective control is at the source host, throttling the data injection rate.
Three‑Layer Congestion Management (ECN and DCQCN)
DCQCN combines ECN signaling on switches and NICs to provide end‑to‑end congestion control for RoCEv2. When a switch detects congestion, it marks the IP header’s ECN bits. The receiver sends Congestion Notification Packets (CNP) back to the sender, which then reduces its transmission rate.
ECN uses two bits in the IP header’s DSCP field: ECT (ECN‑capable Transport) and CE (Congestion Experienced). Values 01 or 10 indicate ECN is enabled; 11 signals congestion.
ECN Interaction Process
The sender transmits IP packets with ECN capable marking (10).
If a switch’s queue becomes congested, it changes the ECN field to 11 and forwards the packet.
The receiver processes the ECN‑marked packet and detects congestion.
The receiver generates a Congestion Notification Packet (CNP) with ECN set to 01, ensuring the packet is not dropped.
Switches forward the CNP normally.
The sender receives the CNP and applies a rate‑limiting algorithm to the affected flow.
Because CNPs traverse every device in the path, they incur latency; if the sender cannot react quickly, packet loss may still occur. It is advisable to configure both ECN and PFC, tuning their buffer thresholds so that ECN triggers before PFC, and to fall back to PFC if ECN is too slow.
Summary
By classifying network traffic with DSCP at Layer 3 and configuring PFC and ECN, we achieve precise control of RDMA flows, construct a lossless network, and eliminate packet loss, thereby delivering RDMA’s high throughput, ultra‑low latency, and low CPU overhead in practice.
Qingyun Technology Community
Official account of the Qingyun Technology Community, focusing on tech innovation, supporting developers, and sharing knowledge. Born to Learn and Share!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
