Industry Insights 12 min read

Why Do Data Center Networks Congest? Unpacking Many‑to‑One and All‑to‑All Incast Scenarios

The article analyzes how CLOS spine‑leaf data‑center networks encounter congestion under many‑to‑one and all‑to‑all traffic patterns, explains the limitations of simply enlarging buffers, and details how ECN and PFC mechanisms can be tuned to achieve loss‑less, low‑latency operation.

Architects' Tech Alliance

Jun 1, 2024

Why Do Data Center Networks Congest? Unpacking Many‑to‑One and All‑to‑All Incast Scenarios

Data Center Congestion Scenarios

Data‑center networks often suffer from congestion, primarily due to two traffic models: the many‑to‑one (many senders to a single receiver) and the all‑to‑all (multiple senders to multiple receivers) patterns.

CLOS Spine‑Leaf Architecture

The prevalent CLOS architecture in modern data centers uses a spine‑leaf topology. By providing equal‑cost multi‑path routing, it offers non‑blocking performance, scalability, simplicity, and easy understandability. Full cross‑connect between spine and leaf layers ensures that a single switch failure does not disrupt the entire network.

Incast in Many‑to‑One Traffic

Consider a CLOS network with leaf1‑leaf4 and spine1‑spine3. A distributed storage service runs on four servers. When server2 reads data simultaneously from server1, server3 and server4, three flows converge on a single leaf‑to‑server port, creating a 3‑to‑1 incast. Although the overall topology is non‑blocking, the buffer on leaf2 becomes the bottleneck. As long as the many‑to‑one traffic persists, the buffer overflows and packets are dropped, degrading throughput and latency.

All‑to‑All Traffic and Load Balancing

In a different scenario, server1 writes to server2 while server4 writes to server3. The two independent one‑to‑one flows form a 2‑to‑2 pattern. The network remains non‑blocking except for an incast of 2‑to‑1 on spine2 toward leaf2, again limited by buffer size. To achieve loss‑less operation under all‑to‑all traffic, load‑balancing is required so that multiple one‑to‑one flows do not intersect on the same switch.

Why Buffer Scaling Is Not Sufficient

Increasing buffer size can temporarily alleviate incast, but it becomes ineffective as network scale and link bandwidth grow. Large buffers also raise chip cost dramatically, making the approach uneconomical. Therefore, merely enlarging buffers cannot guarantee loss‑less, low‑latency performance.

Congestion Control Mechanisms

To prevent buffer overflow, switches must signal senders early and retain enough buffer to hold packets until the sender throttles. This requires a congestion‑control mechanism that limits the aggregate traffic entering a congested port.

ECN (Explicit Congestion Notification) Fundamentals

ECN allows a receiver that detects congestion to notify the sender via a protocol packet, prompting the sender to reduce its transmission rate before packet loss occurs. This early‑warning mechanism brings several advantages:

All senders can sense congestion on the path early and voluntarily slow down, preventing congestion buildup.

Switches mark packets that exceed the average queue length with the ECN flag instead of dropping them, preserving throughput.

Reduced packet loss eliminates costly retransmission timers, improving latency for delay‑sensitive applications.

Overall network utilization improves because the network no longer oscillates between overload and underload.

Relationship Between ECN and PFC Thresholds

The process works as follows:

When a device’s lossless queue exceeds the ECN threshold, the device marks the ECN field (set to 11) on outgoing packets.

The destination server receives the ECN‑marked packet and sends a Congestion Notification Packet (CNP) back to the source, which then reduces its sending rate.

If the queue continues to fill and surpasses the PFC (Priority Flow Control) pause‑frame threshold, the device sends a PFC pause frame to the source, halting traffic for the affected priority.

When the queue drains below the PFC release threshold, the device sends a PFC resume frame, allowing the source to resume transmission.

Because there is a time gap between ECN marking and the source’s rate reduction, traffic continues to arrive at the congested device during this interval. Properly setting the ECN and PFC thresholds ensures that the buffer space between the two thresholds can absorb the traffic generated in this window, minimizing the chance that PFC pause frames are triggered.

Congestion Control Data Center Networking CLOS Spine‑Leaf ECN Incast PFC

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.