Why ECMP Struggles in AI‑Driven Data Centers and Better Load‑Balancing Alternatives
As AI workloads push intelligent compute power growth beyond 50% CAGR, data‑center networks face massive parallel paths, making traditional ECMP load‑balancing insufficient and causing severe congestion, while newer granular schemes such as packet‑spraying, flowlet, and cell‑based balancing offer higher bandwidth utilization and fairness.
Background
Rapid AI development is driving an explosion of intelligent applications, and IDC predicts that China’s intelligent compute capacity will grow at a compound annual rate of over 50% in the next five years. This surge demands data‑center networks with larger scale, higher bandwidth, lower latency, and stronger reliability.
Topology and Load‑Balancing Challenge
Typical data‑center topologies such as Spine‑Leaf or Fat‑Tree provide many equal‑cost parallel paths. While routing is simple for regular topologies, the abundance of parallel links creates a critical load‑balancing problem: traffic must be evenly distributed across dozens of paths.
Traditional ECMP
Equal‑Cost Multi‑Path (ECMP) selects a next‑hop by hashing packet header fields and taking the modulo of the available path count. Packets from the same flow always follow the same path, preserving order (flow‑based load‑balancing). However, ECMP cannot evenly split traffic when flows are highly skewed (e.g., “elephant” vs. “mouse” flows). In AI/ML and HPC scenarios that use RDMA and generate massive per‑flow bandwidth, ECMP often leads to hash‑induced congestion, link overload, and increased packet loss, which dramatically lengthens AI/ML job completion times.
Types of Congestion
End‑side congestion (e.g., Incast) – usually mitigated by congestion‑control algorithms.
Matrix congestion – caused by uneven hash distribution across parallel links; this article focuses on solving matrix congestion.
Mitigation Strategies
Fat‑Tree Architecture : Increases aggregation bandwidth by using a 1:1 input‑output convergence design.
Virtual Output Queuing (VoQ) : Creates a separate output queue for each destination port, eliminating head‑of‑line (HoL) blocking and improving throughput.
Load‑Balancing Granularity : Different schemes affect bandwidth utilization:
Cell‑based (finest granularity, highest utilization)
Packet‑based
Flowlet‑based
Flow‑based (coarsest granularity, lowest utilization)
Flow‑Based Load Balancing
Routes entire flows to a single path using ECMP hashing. Problems include:
Elephant and mouse flows share the same hash, causing severe imbalance in AI/ML workloads.
ECMP cannot detect congested links, potentially worsening congestion.
Fails in asymmetric topologies after failures, leading to traffic imbalance.
Although ECMP is simple and avoids packet reordering, its limitations motivate more sophisticated approaches such as Hedera or BurstBalancer, which use a centralized controller to optimize large‑flow paths.
Packet‑Based Load Balancing (Random Packet Spraying, RPS)
RPS distributes individual packets across equal‑cost paths, achieving finer granularity than ECMP. Benefits include higher link utilization (often >90%) and improved end‑to‑end throughput. Drawbacks are potential packet reordering, which must be handled (e.g., NVIDIA’s BlueField‑3 DPU uses Direct Data Placement to reorder packets).
Flowlet‑Based Load Balancing
Flowlet exploits TCP burst characteristics: a flow is split into short bursts (flowlets) separated by idle gaps; each flowlet can be routed independently, reducing reordering. Limitations are that it relies on TCP behavior and does not work well for short‑lived connections or RDMA traffic, which lack the required burst patterns.
Cell‑Based Load Balancing
In cell‑based switching, packets are sliced into small cells and scheduled per‑cell based on real‑time link availability. This yields the highest theoretical bandwidth utilization but requires specialized hardware and is currently confined to chassis‑internal networks (e.g., AT&T’s DDC specification). In AI/ML environments, added static latency (≈1.4×) and hardware dependency limit its practical adoption.
Industry Practices
Both NVIDIA and Huawei address the ECMP shortcomings:
NVIDIA’s Spectrum‑4 switch, combined with BlueField‑3 DPU, monitors per‑link congestion and applies dynamic per‑packet load‑balancing, while DPU’s Direct Data Placement resolves out‑of‑order packets.
Huawei’s ACC (Automatic ECN) dynamically adjusts ECN thresholds across switches and uses online training to prioritize small flows, reducing completion time for both mouse and elephant flows.
Conclusion
ECMP remains widely used due to its simplicity, but its coarse granularity and inability to react to congestion make it unsuitable for modern AI/ML and HPC workloads. Selecting a more granular load‑balancing scheme—packet‑based, flowlet‑based, or cell‑based—depends on the specific traffic patterns, hardware capabilities, and latency tolerance of the target data‑center environment.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
