Managing Congestion in Ethernet Storage Networks – Part 7: VXLAN and Lossless Traffic
This article explains how Ethernet storage networks handle congestion by detailing MAC learning methods, lossless traffic classification over VXLAN using DSCP and ECN, PFC flow control, congestion notification, troubleshooting steps, and preventive measures, supported by diagrams and industry references.
MAC Address Learning
Two common ways to learn the MAC addresses of devices attached to a remote VTEP are described: a multicast‑flood based learning mechanism and MP‑BGP EVPN. Regardless of the method, the data path remains unchanged, so congestion management is unaffected.
Lossless Traffic over VXLAN
VXLAN can classify traffic using the DSCP field in the IP header and assign it to a lossless queue, enabling lossless transport. The earlier Layer‑2 PFC classification based on the CoS field is insufficient for VXLAN because the IEEE 802.1Q VLAN header is not retained inside the VXLAN tunnel, causing the CoS value to be lost.
VXLAN Encapsulation
During encapsulation (see Figure 7‑25), the ingress VTEP copies the DSCP value from the original IP header into the outer VXLAN header. For Layer‑2 frames that lack an IP header, the DSCP value is derived from the CoS‑to‑DSCP mapping table (Table 7‑1).
VXLAN Decapsulation
At egress, the VTEP copies the DSCP value from the outer VXLAN header back into the decapsulated IP header. This "unified" mode is the default on Cisco Nexus 9000 switches; the DSCP value can also be copied from the inner IP header to the decapsulated packet in "pipeline" mode.
Congestion Notification over VXLAN
At the ingress VTEP, the ECN value from the original packet is copied to the outer VXLAN header. At the egress VTEP, the ECN value is always copied from the outer header back to the decapsulated IP header, irrespective of whether the switch operates in unified or pipeline mode.
Flow Control and Congestion Notification with VXLAN
Two considerations for lossless VXLAN traffic are presented:
Mandatory per‑hop flow control (PFC) to enforce lossless behavior.
Optional ECN‑based congestion notification to inform end hosts when congestion is detected between ingress and egress VTEPs.
In the example (Figure 7‑26), traffic marked with DSCP CS3 is mapped to a lossless queue. Ingress VTEP‑1 copies the DSCP value to the outer header, ensuring lossless handling through the VXLAN tunnel. At egress VTEP‑6, the DSCP value is copied back, so the CS3‑marked traffic is placed on lossless queues on all devices, matching the behavior of a non‑VXLAN environment.
Congestion Management in VXLAN
Four practical points are highlighted:
Understanding congestion: After enabling PFC, congestion can spread across the VXLAN fabric. When an egress VTEP (or leaf switch) queue fills, it sends pause frames, slowing all traffic in that class, regardless of VXLAN encapsulation.
Detecting congestion: Detection methods follow the earlier discussion and must consider the DSCP‑CoS mapping on each VTEP.
Troubleshooting congestion: Look for the IP addresses of the outer VXLAN header, which reveal the ingress and egress VTEPs. Focus on monitoring lossless‑class traffic and pause frames rather than total traffic volume.
Preventing congestion: ECN‑based mechanisms work regardless of the underlying VXLAN network. For example, RoCEv2 traffic can traverse VXLAN and benefit from RCM‑based congestion management.
Summary of Ethernet Lossless Behavior
By default, Ethernet handles congestion by dropping frames (lossy Ethernet) and relying on upper‑layer protocols such as TCP to retransmit. Lossless Ethernet uses per‑hop flow control (PFC) to send pause frames, optionally applying LLFC to all traffic or PFC to specific traffic classes. ETS provides minimum bandwidth guarantees, and DCBX simplifies configuration between endpoints and switches. Using the VLAN PCP/CoS field at OSI Layer 2 enables traffic classification for PFC, which applies to FCoE, RoCE, and RoCEv2 (the latter using IP DSCP for classification).
Lossless Ethernet networks can suffer congestion similar to Fibre Channel due to slow drain, high link utilization, bit errors, or insufficient buffering. Metrics such as TxWait and RxWait are not yet exposed on Cisco Nexus 9000 switches, so pause‑frame counters and external monitoring platforms are recommended for detection.
When mixed lossless and lossy traffic share a converged Ethernet fabric, monitoring port‑level and class‑level utilization is essential. A PFC watchdog can discard frames after a timeout, freeing buffers and aiding recovery.
Regardless of the preventive mechanism used, continuous monitoring, root‑cause analysis, and timely remediation remain critical. As Ethernet fabrics evolve, lessons from decades of Fibre Channel deployment should be applied to proactively prevent congestion.
References
FC‑BB‑5: http://fcoe.com/09-056v5.pdf
802.1 Data Center Bridging Task Group: http://www.ieee802.org/1/pages/dcbridges.html
802.1az Enhanced Transmission Selection and DCBX: http://www.ieee802.org/1/pages/802.1az.html
802.1bb Priority‑based Flow Control: http://www.ieee802.org/1/pages/802.1bb.html
End‑to‑End QoS Network Design (Cisco Press)
Cisco Nexus 9000 Series NX‑OS Interfaces Configuration Guides
I/O Consolidation in the Data Center (Cisco Press)
Cisco Nexus 9000 Series NX‑OS Programmability Guide
NVMe‑oF Configuration with RoCEv2 – BRKDCN‑3282 (Cisco Live 2020)
Cisco White Paper: Priority Flow Control
RFC 3168 – The Addition of Explicit Congestion Notification (ECN) to IP
Building Data Centers with VXLAN BGP EVPN (Cisco Press)
VXLAN BGP EVPN Multi‑Site – BRKDCN‑2913 (Cisco Live 2022)
A Day in the Life of a VXLAN EVPN Multi‑Site Packet – BRKDCN‑2345 (Cisco Live 2022)
Implementation of PFC and RCM for RoCEv2 Simulation in OMNeT++
Revisiting Network Support for RDMA – SIGCOMM 2018
Cisco Nexus 9000 Series NX‑OS VXLAN Configuration Guide
Cisco IP Telephony Flash Cards: Weighted Random Early Detection (WRED): https://www.ciscopress.com/articles/article.asp?p=352991&seqNum=8
IANA Service Name and Transport Protocol Port Number Registry: https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.txt
SONiC PFC Watchdog Design: https://github.com/sonic-net/SONiC/wiki/PFC-Watchdog-Design
Cisco White Paper – RoCE Storage Implementation over NX‑OS VXLAN Fabrics
Cisco White Paper – Understanding FEC and Its Implementation in Cisco Optics
Cisco Troubleshooting Technote – Nexus 9000 Cloud Scale ASIC CRC Identification & Tracing Procedure
Cisco Troubleshooting Technote – Understand Cyclic Redundancy Check Errors on Nexus Switches
Linux Code Review Hub
A professional Linux technology community and learning platform covering the kernel, memory management, process management, file system and I/O, performance tuning, device drivers, virtualization, and cloud computing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
