Industry Insights 7 min read

Why Ethernet Struggles with AI Workloads and How Adaptive Routing Solves It

The article analyzes how AI‑driven elephant flows overload traditional Ethernet networks, causing long‑tail latency and victim‑flow congestion, and explains how adaptive routing, RDMA/ RoCE features, advanced congestion‑control algorithms, and high‑capacity switch chips can mitigate these challenges.

Architects' Tech Alliance

Jul 6, 2024

Why Ethernet Struggles with AI Workloads and How Adaptive Routing Solves It

Traditional cloud computing traffic consists of small, stable flows that rarely cause congestion, so conventional Ethernet routing policies work well. In contrast, AI inference clusters generate massive "elephant flows" that dominate a few paths, leading to a long‑tail effect where most links finish early while a few overloaded links delay overall system utilization.

Victim Flow Phenomenon

When multiple AI workloads share common leaf or spine switches, a "many‑to‑one" (Victim Flow) situation arises. For example, loads A (ports 1‑3) and B (ports 4) both traverse shared switches a and b. Congestion on switch a caused by load A propagates to switch b, even though load B uses a different set of ports, resulting in reduced steady‑state bandwidth for both flows.

RDMA/ RoCE Adaptive Routing Solution

RDMA‑based networks address this problem through adaptive routing that dynamically monitors port queue lengths and redirects new packets to the least‑loaded ports or paths, achieving load balancing across the fabric.

The switch evaluates each port’s output queue status, determines its load, and routes incoming packets to the port/path with the smallest load.

Rerouted packets may arrive out of order; the DDP (Direct Data Placement) protocol embeds a prefix indicating the original memory location, allowing the NIC to reorder data correctly before placement.

By continuously balancing traffic, adaptive routing mitigates the long‑tail impact of elephant flows in AI workloads.

Switch Congestion‑Control and Cache Pooling

Modern switches implement two key mechanisms:

Real‑time monitoring of transmission rates and congestion levels; the switch chip processes local and neighboring node metrics and adjusts transmission rates based on congestion‑control algorithms.

Physical cache pooling, where each port’s receive and transmit rates dictate cache allocation, providing performance isolation between ports.

Hardware Capacity Enhancements

Chip manufacturers have increased RoCE‑compatible capacity. For instance, Broadcom’s Tomahawk 5 switch offers a total capacity of 51.2 TB, supporting 64 ports at 800 Gb/s each—approximately double the previous generation. Such capacity upgrades, together with RoCE’s adaptive routing, congestion control, and cache pooling, require coordinated support from both switch ASICs and NIC firmware.

White‑Box Switch Opportunities

White‑box switches adopt an open‑architecture model, combining commercial hardware with open‑source network operating systems. This flexibility enables vendors and cloud operators to develop custom algorithms for adaptive routing, congestion control, and cache management, further extending the applicability of RoCE in AI‑centric data centers.

RDMA AI computing Ethernet RoCE network congestion Elephant flow Adaptive routing

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.