How to Troubleshoot Congestion in Lossless Ethernet Storage Networks – Part 5
This article explains a step‑by‑step methodology for detecting, diagnosing, and resolving congestion in lossless Ethernet storage networks, covering severity levels, spine‑leaf troubleshooting workflows, remote monitoring, comparative analysis of pause‑frame metrics, and real‑world case studies that illustrate the impact of over‑utilization and mixed traffic on network performance.
Goals
The primary goal is to identify the source (the culprit) and cause of congestion, such as slow drain detected via high TxWait or excessive pause‑frame counts, or over‑utilization indicated by high egress utilization. The secondary goal is to pinpoint the affected devices (victims), which may be direct, indirect, or same‑path victims.
Congestion Severities and Levels
Three severity levels are defined for lossless Ethernet:
Level 1 – Mild: Latency increases but no frame loss. Detect by monitoring pause‑frame counts, TxWait/RxWait (if available), and link utilization.
Level 2 – Moderate: Both latency and frame loss increase. Detect by observing loss in a lossless class.
Level 3 – Severe: Latency increase, frame loss, and sustained traffic pause. Detect with pause‑frame timeout or PFC watchdog.
Methodology
We recommend troubleshooting from the highest severity downwards. If Level 3 metrics are unavailable, start with Level 2 (packet loss) and then Level 1 (pause‑frame counts). The workflow can be customized per environment.
Troubleshooting Congestion in a Spine‑Leaf Topology
Assume a host connected to Leaf‑1 reports performance degradation (indirect victim). Follow these steps:
Check the directly connected switch port for egress congestion (Rx pause or egress packet loss). If present, the host is the culprit.
If not, look for ingress congestion (Tx pause) on any other edge port of the same switch. A port showing Tx pause indicates the upstream device is sending traffic to the culprit.
Inspect the upstream port on Leaf‑1 for egress congestion.
Move upstream to a spine device (e.g., Spine‑1) and verify that its Tx pause matches the Rx pause on Leaf‑1. Mismatched values suggest bit errors or firmware bugs.
Continue upstream, checking each device for egress pause or packet loss until the source is found.
Prioritize higher‑severity symptoms (packet loss before pause‑frame counts) when multiple ports show congestion.
Reality Check
Manual CLI inspection is difficult because most Ethernet switches (e.g., Cisco Nexus 9000) do not retain timestamped congestion events, and they often lack TxWait / RxWait counters. Users must repeatedly poll cumulative pause counters and compute deltas, which is error‑prone at scale.
Remote Monitoring Platform
Using a remote monitoring system (e.g., UCS Traffic Monitoring) allows continuous polling of pause‑frame counts with timestamps, simplifying real‑time congestion detection.
Comparative Analysis
Periodically compare pause‑frame rates across host and switch ports. Poll every 60 seconds, compute the delta, and rank ports by descending pause‑frame count to identify top‑suspect devices.
Trends and Seasonality
Analyze pause‑frame counts for long‑term trends, peaks, and daily/weekly patterns to differentiate transient spikes from persistent congestion.
Monitoring a Slow‑Drain Suspect
Identify devices that send pause frames but do not exceed a few hundred per second; a sudden increase to thousands per second marks a likely culprit.
Monitoring an Over‑Utilization Suspect
When a port operates at or near 100 % utilization, investigate egress utilization rather than pause‑frame counters to locate the source.
FC and FCoE in the Same Network
FC and FCoE ports use different congestion metrics ( Rx B2B for FC ingress, Tx B2B for FC egress, and PFC for FCoE). The troubleshooting steps are analogous but require the appropriate commands for each protocol.
Multiple No‑Drop Classes on the Same Link
When several lossless classes (CoS) are enabled, troubleshoot one class at a time, following the same severity‑based workflow.
Bandwidth Allocation Between Lossless and Lossy Traffic
ETS guarantees a minimum bandwidth (e.g., 50 % of link capacity) for lossless classes but allows them to use up to 100 % when other classes are idle. Over‑utilization of lossless classes can cause congestion when lossy traffic competes for the same link.
Effect of Lossy Traffic on No‑Drop Class
Lossy traffic can reduce the effective bandwidth available to lossless classes, causing congestion that would not appear in a purely lossless environment.
Case Study 1 – Online Gaming Company
The company used a converged Ethernet fabric for I/O (lossless) and TCP/IP (lossy) traffic. During peak hours, a server with high CPU usage sent many pause frames, causing congestion to spread to other servers. After moving the workload to a more powerful server, pause‑frame counts dropped, CPU usage normalized, and performance issues disappeared, illustrating the importance of monitoring per‑class traffic and the impact of lossy traffic on lossless classes.
Case Study 2 – Converged vs. Dedicated Storage Network
In a similar environment, lossless traffic averaged 6 Gbps (60 % of a 10 GbE link) while lossy traffic spiked from 2 Gbps to 5 Gbps, exceeding the link’s capacity and forcing PFC to throttle lossless traffic. Adding a second 10 GbE link resolved the contention, highlighting the trade‑off between converged and dedicated storage networks.
Overall, the article demonstrates a systematic, data‑driven approach to diagnosing congestion in lossless Ethernet storage networks, emphasizing the need for accurate metrics, proper severity classification, and awareness of how lossy traffic can affect lossless classes.
Linux Code Review Hub
A professional Linux technology community and learning platform covering the kernel, memory management, process management, file system and I/O, performance tuning, device drivers, virtualization, and cloud computing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
