Operations 33 min read

Advanced Congestion Management Techniques for Lossless Ethernet Storage Networks

The article examines high‑level strategies for preventing and recovering from congestion in lossless Ethernet storage networks, including disconnecting faulty devices, early frame dropping, traffic isolation, endpoint notifications, rate limiting, pause‑timeout, PFC watchdog mechanisms, detailed Cisco configuration commands, and the benefits and limitations of each approach.

Linux Code Review Hub

Mar 6, 2024

Advanced Congestion Management Techniques for Lossless Ethernet Storage Networks

Preventing Congestion in Lossless Ethernet Networks

High‑level methods used in Fibre Channel fabrics also apply to lossless Ethernet. The main techniques are disconnecting the offending device, early frame dropping, traffic isolation, notifying end devices, limiting traffic to congested ports, redesigning the network, and upgrading links.

Disconnecting Faulty Devices

Monitoring ingress pause frames and disabling a port that cannot transmit for several hundred milliseconds removes the congestion source. This "big hammer" approach is widely used but leaves the device offline.

Early Frame Dropping

Frames that remain in a switch beyond a timeout are dropped, freeing buffer space. Cisco Nexus 9000 switches use pause‑frame timeout and PFC watchdog features to implement this.

Traffic Isolation

Creating dedicated VLANs and ISL for the offending device isolates its traffic, preventing impact on other devices. Multiple lossless classes can also be used, though their adoption in lossless Ethernet is still unclear.

Endpoint Congestion Notification

Switches can explicitly notify end devices of congestion using ECN (Explicit Congestion Notification) or, for RoCEv2, the ECN‑capable transport (ECT) flag. The destination marks packets with the CE flag, which the source then reacts to by reducing its transmission rate.

Rate Limiting

Configuring traffic‑rate limiters on end devices can curb congestion, though practical usage in lossless Ethernet is not yet well documented.

Network Redesign and Upgrade

Segmenting a large fabric into smaller islands limits the spread of congestion. Upgrading link speeds or adding additional links can also alleviate persistent congestion.

Congestion Recovery by Dropping Frames

When a device cannot receive frames for an extended period, dropping those frames after a timeout releases buffer space. Two implementations are described:

Drop frames based on their age in the switch (Cisco MDS FCoE ports using system timeout fcoe pause-drop).

Drop frames destined for a slow‑drain edge port when the pause duration exceeds the timeout.

Pause Timeout

The pause‑timeout granularity is 100 ms; the switch discards all egress traffic on a port that remains in Rx‑pause for the configured interval. The default timeout is 500 ms on Cisco MDS and configurable on Nexus devices (e.g., system default interface pause timeout 100 to 500 ms).

PFC Watchdog

The PFC watchdog works like pause‑timeout but only discards traffic that has been paused by PFC. It can:

Flip or shut down the port (destructive watchdog).

Generate only an alert (log watchdog).

Close the queue, dropping all packets in it and any new ingress packets for that lossless class.

Implementation details differ by platform:

Cisco MDS : No PFC watchdog; pause‑timeout is used for FCoE traffic.

Cisco Nexus 9000/3000 : PFC watchdog is available; configuration includes priority-flow-control watch-dog-interval 100, priority-flow-control watch-dog shutdown-multiplier 1, priority-flow-control watch-dog auto-restore multiplier 10, etc.

Cisco UCS : PFC watchdog enabled by default on newer UCS Manager versions.

Example counters (NX‑OS) show shutdowns, restores, total packets drained, and packets dropped (e.g., 2 197 357 321 egress packets dropped on Ethernet1/5). Counters can be cleared with clear queuing pfc-queue.

Granularity and Limitations

Both pause‑timeout and PFC watchdog rely on a 100 ms software poll, causing up to 99 ms delay before action. They are ineffective for very short pauses (<50 ms) and only act on continuous pause periods.

Congestion Notification in Routed Lossless Ethernet

RoCEv2 uses ECN bits in the IP header. When a switch detects queue utilization above a threshold, it marks packets with CE. The destination, upon receiving CE‑marked packets, sends a Congestion Notification Packet (CNP) back to the source, which then reduces its rate. RFC 3168 defines ECN for TCP; RoCEv2 applies the same concept over UDP.

RCM Considerations

Mixed environments with RCM‑capable and non‑RCM devices can cause unfair rate reductions.

Rate‑reduction algorithms are vendor‑specific; there is no standard algorithm.

Synchronization issues may lead to over‑ or under‑reaction when multiple sources react to the same congestion signal.

Delay between detection and action can render ECN ineffective for bursty traffic.

Configuration complexity: thresholds for WRED, pause‑timeout, and PFC must be tuned per switch architecture.

PFC and ECN Together

Combining PFC (hop‑by‑hop flow control) with ECN (end‑to‑end congestion signaling) provides rapid local reaction (PFC) and traffic‑aware rate reduction (ECN). The article illustrates a six‑step process with diagrams (Figures 7‑18 to 7‑23) showing how PFC pauses traffic, ECN marks packets, the destination generates CNP, and the source throttles its flow.

VXLAN and Lossless Traffic

VXLAN extends Layer 2 over Layer 3 using MAC‑in‑UDP encapsulation, allowing lossless traffic to traverse routed networks. The VNI is a 24‑bit identifier supporting up to 16 million virtual networks. VXLAN fabrics typically use a spine‑leaf topology with IS‑IS or OSPF for routing. End‑device congestion handling follows the same principles as native Ethernet (PFC, ECN, RCM).

Physical Topology

Leaf switches (e.g., Cisco Nexus 9000) act as VTEPs with both Layer 2 interfaces to local hosts and Layer 3 uplinks to the spine. Traffic between VTEPs traverses the spine using ECMP, and the underlying switches are unaware of the VXLAN encapsulation, operating only on the outer IP header.

Overall, the article provides a comprehensive, step‑by‑step guide to detecting, preventing, and recovering from congestion in lossless Ethernet storage networks, with concrete command examples, counter outputs, and practical considerations for real‑world deployments.

Ethernet VXLAN RoCEv2 ECN PFC Lossless Ethernet Cisco Nexus Congestion Management

Written by

Linux Code Review Hub

A professional Linux technology community and learning platform covering the kernel, memory management, process management, file system and I/O, performance tuning, device drivers, virtualization, and cloud computing.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.