Operations 34 min read

Layer‑3 Priority Flow Control in Ethernet Storage Networks (Part 3)

This article explains how layer‑3 priority flow control uses the DSCP field to map VLAN CoS classes to lossless queues, details the configuration steps for PFC, ETS, and DCBX, and describes congestion scenarios and detection metrics in fused Ethernet data‑center networks.

Linux Code Review Hub
Linux Code Review Hub
Linux Code Review Hub
Layer‑3 Priority Flow Control in Ethernet Storage Networks (Part 3)

Layer‑3 priority flow control operates at OSI layer 3, where traffic is identified by IPv4 or IPv6 source and destination addresses. The IP header contains a six‑bit DSCP field that can represent up to 64 traffic classes, though not all are used.

Figure 7‑5 illustrates the relationship between the IP DSCP field and the PFC pause‑frame class‑enable vector. Because a PFC pause frame carries only eight class‑enable bits, a mapping (Table 7‑1) is required to translate between CoS and DSCP values, known as the CoS‑to‑DSCP or DSCP‑to‑CoS mapping.

In the example, Host‑1, Switch‑1, and Target‑1 agree to use CS3 for lossless traffic. Target‑1 marks the IP header with DSCP value 24 (binary 011000). Switch‑1 maps this DSCP to a lossless queue and, when the queue exceeds the pause threshold, sends a PFC pause frame with class‑enable vector 00001000, causing Target‑1 to stop transmitting CS3 packets without affecting other classes. To make CS4 traffic lossless, the class‑enable vector would be 00011000.

Table 7‑1 – Ethernet VLAN CoS and IP DSCP mapping

Table 7‑1 mapping
Table 7‑1 mapping

To understand the default CoS‑to‑DSCP mapping on Cisco Nexus 9000 switches, use the NX‑OS command show system internal ipqos global-defaults.

Fused Ethernet networks combine lossless and lossy traffic. Besides PFC, they require bandwidth guarantee (implemented via Enhanced Transmission Selection, ETS, IEEE 802.1Qaz) and consistent configuration across devices, typically achieved with DCBX (IEEE 802.1Qaz) and LLDP (IEEE 802.1AB‑2005) for automatic discovery and advertisement.

PFC, ETS, and DCBX belong to the Data Center Bridging (DCB) family, also known as Data Center Ethernet (DCE), Converged Ethernet (CE), or Converged Enhanced Ethernet (CEE).

Network types :

Dedicated lossless network – only lossless traffic.

Shared storage network – carries both lossless and lossy traffic without explicit lossless handling.

Aggregated (fused) network – uses PFC, ETS, and DCBX to support both traffic types.

Configuring lossless Ethernet involves three steps:

Classifying and marking the traffic : Use VLAN CoS (layer 2) for marked frames; otherwise use the IP DSCP field (layer 3). Edge ports can trust host‑marked packets or apply their own classification.

Flow‑control and bandwidth allocation : Identify lossless classes (controlled by PFC) and allocate bandwidth (e.g., 50 % of link capacity) while allowing lossy classes unrestricted bandwidth.

Consistent implementation : Ensure all devices apply the same QoS configuration; mismatched CoS settings cause lossless traffic to become lossy. Automation via DCBX or SDN can simplify this process.

Congestion scenarios :

Single‑switch lossless network (Fig 7‑6): A slow‑draining host (Host‑1) sends PFC pause frames, causing all eight targets to pause, affecting all traffic classes.

Edge‑core lossless network (Fig 7‑7): A congested edge switch forwards pause frames to the core, which then propagates pause to up to 199 hosts downstream.

Leaf‑spine lossless network (Fig 7‑8): A slow‑draining host on a leaf triggers pause frames that cascade through the spine to other leaves, impacting many hosts regardless of their traffic class.

Congestion can be caused by slow drain devices, over‑utilized links, or bit errors. Detection distinguishes between these causes by examining pause‑frame counts, TxWait/RxWait durations, and buffer utilization.

Congestion‑detection workflow includes identifying what is detected, its impact, root cause, culprit device, propagation path, timing, detection method (reactive, proactive, predictive), and monitoring location (switch, host, server, storage array). Remote monitoring platforms (e.g., UCS traffic‑monitoring apps) are often used.

Detection metrics :

Pause‑frame monitoring (count, duration, TxWait, RxWait).

Instantaneous buffer occupancy (distance to pause threshold).

Frame drops/discards.

Bit‑error counters (CRC, stomped CRC, FEC).

Link utilization (speed, input/output bytes, Txload/Rxload percentages).

TxWait and RxWait represent the time a port cannot transmit or receive because it has received a pause frame. They can be expressed as raw microseconds or as a percentage of a time interval (e.g., 50 % TxWait over 20 s means the port was blocked for 10 s).

Examples of metric collection: show interface priority-flow-control on Cisco Nexus 9000 and Cisco UCS displays total pause‑frame counts per class.

Per‑class pause‑frame counts require show queuing interface.

TxWait/RxWait history is shown on Cisco MDS and Nexus 7000 switches (Fig 7‑5) and can be logged to the onboard fault log (OBFL) when thresholds are exceeded.

Pause‑frame counts on specific interfaces (e.g., Ethernet 1/8) reveal whether congestion is inbound or outbound.

CRC error counters increment only when bit errors occur within a frame; stomped CRC counters help locate errors in cut‑through switches.

FEC counters (correctable and uncorrectable) indicate whether forward error correction recovered errors; when FEC cannot correct, CRC may still increment.

Link‑utilization counters (interface speed, cumulative bytes, 30‑second and 5‑minute averages, Txload/Rxload) provide throughput percentages.

Sample command output for FEC counters on a Cisco Nexus 9000:

switch# attach module 1
module-1# show hardware internal tah mac hwlib show mac_errors fp-port 15
... (output omitted) ...
MAC: FEC Err per ch: 0   1   2   3
--------------------------------------------
RS FEC Correctable .....
RS FEC UnCorrectable .....
FEC Not enabled for lane 1.
FEC Not enabled for lane 2.
FEC Not enabled for lane 3.

Understanding these metrics enables operators to pinpoint the source of congestion, differentiate between slow‑drain and over‑utilization conditions, and apply appropriate remediation such as adjusting pause‑frame thresholds, rebalancing bandwidth guarantees, or fixing bit‑error sources.

Key takeaways :

Accurate CoS‑to‑DSCP mapping is essential for lossless traffic classification.

Consistent QoS configuration across all devices prevents lossless traffic from becoming lossy.

Monitoring TxWait/RxWait, pause‑frame counts, and error counters provides a comprehensive view of congestion health.

Automation (DCBX, SDN) reduces manual errors and speeds up fault isolation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

QoSData CenterEthernetPFCDSCPCongestion Management
Linux Code Review Hub
Written by

Linux Code Review Hub

A professional Linux technology community and learning platform covering the kernel, memory management, process management, file system and I/O, performance tuning, device drivers, virtualization, and cloud computing.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.