Why Ethernet Struggles with AI Workloads and How Adaptive Routing Solves It
The article analyzes how AI‑driven elephant flows overload traditional Ethernet networks, causing long‑tail latency and victim‑flow congestion, and explains how adaptive routing, RDMA/ RoCE features, advanced congestion‑control algorithms, and high‑capacity switch chips can mitigate these challenges.
Traditional cloud computing traffic consists of small, stable flows that rarely cause congestion, so conventional Ethernet routing policies work well. In contrast, AI inference clusters generate massive "elephant flows" that dominate a few paths, leading to a long‑tail effect where most links finish early while a few overloaded links delay overall system utilization.
Victim Flow Phenomenon
When multiple AI workloads share common leaf or spine switches, a "many‑to‑one" (Victim Flow) situation arises. For example, loads A (ports 1‑3) and B (ports 4) both traverse shared switches a and b. Congestion on switch a caused by load A propagates to switch b, even though load B uses a different set of ports, resulting in reduced steady‑state bandwidth for both flows.
RDMA/ RoCE Adaptive Routing Solution
RDMA‑based networks address this problem through adaptive routing that dynamically monitors port queue lengths and redirects new packets to the least‑loaded ports or paths, achieving load balancing across the fabric.
The switch evaluates each port’s output queue status, determines its load, and routes incoming packets to the port/path with the smallest load.
Rerouted packets may arrive out of order; the DDP (Direct Data Placement) protocol embeds a prefix indicating the original memory location, allowing the NIC to reorder data correctly before placement.
By continuously balancing traffic, adaptive routing mitigates the long‑tail impact of elephant flows in AI workloads.
Switch Congestion‑Control and Cache Pooling
Modern switches implement two key mechanisms:
Real‑time monitoring of transmission rates and congestion levels; the switch chip processes local and neighboring node metrics and adjusts transmission rates based on congestion‑control algorithms.
Physical cache pooling, where each port’s receive and transmit rates dictate cache allocation, providing performance isolation between ports.
Hardware Capacity Enhancements
Chip manufacturers have increased RoCE‑compatible capacity. For instance, Broadcom’s Tomahawk 5 switch offers a total capacity of 51.2 TB, supporting 64 ports at 800 Gb/s each—approximately double the previous generation. Such capacity upgrades, together with RoCE’s adaptive routing, congestion control, and cache pooling, require coordinated support from both switch ASICs and NIC firmware.
White‑Box Switch Opportunities
White‑box switches adopt an open‑architecture model, combining commercial hardware with open‑source network operating systems. This flexibility enables vendors and cloud operators to develop custom algorithms for adaptive routing, congestion control, and cache management, further extending the applicability of RoCE in AI‑centric data centers.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
