Why Traditional ECMP Fails for AI Workloads and How Modern Load‑Balancing Solves It
The article examines the rapid growth of AI‑driven compute demand, explains why conventional ECMP load balancing struggles with uneven, high‑bandwidth flows in data‑center networks, and compares advanced strategies such as Fat‑Tree design, VoQ, flow‑based, packet‑based, flowlet, and cell‑based approaches, including vendor implementations.
With AI technology advancing rapidly, the scale of intelligent compute in China is projected to grow at a compound annual rate exceeding 50% over the next five years, prompting data‑center networks to require larger scale, higher bandwidth, lower latency, and greater reliability.
Data‑center topologies like Spine‑Leaf are relatively regular, but the presence of many equal‑cost parallel paths (e.g., dozens in a Fat‑Tree) makes load‑balancing routing a critical design challenge.
Traditional Equal‑Cost Multi‑Path (ECMP) selects next hops by hashing packet header fields, keeping packets of the same flow on the same path. While simple, ECMP cannot evenly distribute traffic when flows are highly skewed (e.g., elephant flows), leading to link congestion, especially in HPC and AI scenarios that use RDMA and demand massive bandwidth.
Two main congestion types appear: end‑side congestion (e.g., Incast) and matrix congestion caused by hash imbalance. The article focuses on solving matrix congestion.
Fat‑Tree architecture: increase aggregation link bandwidth with a 1:1 input‑to‑output convergence design.
Virtual Output Queuing (VoQ): creates a separate output queue for each destination port, eliminating head‑of‑line blocking and improving throughput.
Load‑balancing granularity: flow‑based, packet‑based, flowlet‑based, and cell‑based methods, each offering different trade‑offs.
Flow‑based load balancing routes entire flows to a single equal‑cost path. ECMP’s hash treats large and small flows alike, causing bandwidth under‑utilization and inability to react to congestion, especially harmful for AI/ML workloads where a few massive flows dominate.
Packet‑based load balancing (Random Packet Spraying, RPS) distributes individual packets across parallel paths, achieving higher link utilization (often >90%) but may introduce packet reordering, which NVIDIA mitigates with DDP (Direct Data Placement) on BlueField‑3 DPUs.
Flowlet‑based load balancing splits a TCP flow into bursts (flowlets) based on inter‑packet gaps, allowing path changes between bursts while preserving order. It is ineffective for short‑lived flows or RDMA traffic, limiting its use in AI/ML environments.
Cell‑based load balancing breaks packets into small cells and schedules them per‑link, offering the finest granularity and highest bandwidth utilization. However, it adds static latency, requires proprietary hardware, and is rarely deployed in AI compute clusters.
From a practical standpoint, the granularity hierarchy is Cell → Packet → Flowlet → Flow, corresponding to decreasing bandwidth efficiency. Because of hardware constraints, cell‑based solutions are seldom used in intelligent computing.
Modern AI/ML workloads rely on RoCEv2 to reduce CPU overhead, but RoCEv2 lacks robust loss protection. While PFC and ECN help, they cannot fully mitigate the impact of skewed elephant/mouse flows, making sophisticated load‑balancing essential.
Vendor examples:
NVIDIA’s Spectrum‑4 switch, combined with BlueField‑3 DPU and DDP, dynamically selects the least‑congested path per packet, preserving order and boosting performance in large‑scale, high‑load scenarios.
Huawei’s Intelligent Lossless Network uses Automatic ECN (ACC) to adjust marking thresholds distributedly, prioritizing small‑flow latency and improving overall throughput.
Source: https://mp.weixin.qq.com/s/Eds7NKqBsejbiS2Tf0Cm_Q
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.