Industry Insights 11 min read

Can NSLB Double AI Training Speed? Inside the 113% Performance Gain Over ECMP

The article analyzes AI‑training traffic patterns, critiques existing flow‑based, flowlet‑based, and packet‑based ECMP load‑balancing, introduces the NSLB solution tailored for AI clusters, and presents experimental results showing up to 113% speed improvement and sub‑millisecond failover with DPFF, while also discussing direct‑topology and intelligent lossless networking techniques.

Architects' Tech Alliance

Sep 9, 2023

Can NSLB Double AI Training Speed? Inside the 113% Performance Gain Over ECMP

AI Training Traffic Characteristics and Challenges

During AI cluster training, parameters are synchronized across servers via high‑speed interconnects, producing traffic that is periodic, low in flow count, long‑lived, and requires strong real‑time synchronization. These traits cause uneven load distribution and reduced network throughput, limiting overall training performance.

Limitations of Existing Load‑Balancing Techniques

Three mainstream load‑balancing methods are used in data‑center networks: flow‑based ECMP, flowlet‑based ECMP, and packet‑based ECMP. Flow‑based ECMP hashes the five‑tuple of each flow, works well with many flows, but suffers from hash collisions in low‑flow scenarios such as AI training. Flowlet‑based ECMP relies on correctly configured inter‑flow gaps, which is difficult without global path‑delay knowledge and can cause packet reordering at the receiver. Packet‑based ECMP offers the best theoretical balance but introduces severe reordering in practice and is rarely deployed.

NSLB: A Load‑Balancing Innovation for AI Workloads

NSLB (Network‑Specific Load‑Balancing) is designed specifically for AI training traffic. It collects whole‑network flow information and feeds it into a custom routing algorithm to compute optimal forwarding paths, achieving 100% traffic balance and noticeable AI training performance gains.

Experiment setup: 12 GPU servers (each with a Tesla V100S 32 GB GPU) connected via a 2‑tier CLOS network built from four Huawei switches (100 GE ports). The VGG‑16 model was run on TensorFlow. Results:

Single‑task Ring algorithm: NSLB outperformed typical ECMP by 113.41%.

Dual‑task Ring algorithm: NSLB achieved a 57.29% improvement over ECMP.

DPFF: Data‑Plane Fast Failover for Rapid Fault Recovery

To address slow fault convergence, the DPFF (Data Plane Fast Failover) technique leverages programmable forwarding‑chip hardware to detect failures and switch paths in sub‑millisecond (<1 ms) time, dramatically reducing impact on high‑performance databases, storage, and supercomputing workloads.

Test methodology: Four Huawei switches formed a 2‑tier CLOS network; vdbench generated I/O to an SSD array (256 KB messages, 16 threads, write I/O). A fiber cut simulated a link failure. Under DPFF, IOPS remained virtually unchanged, whereas OSPF convergence caused IOPS to drop to zero for several seconds.

Additional benchmark using TPC‑C‑style online transaction processing showed that DPFF reduced the drop in transactions per 100 ms interval by 60‑80% compared to OSPF.

Direct‑Topology and Intelligent Lossless Networking

Traditional 3‑tier CLOS architectures struggle to scale to 10E‑level clusters due to port limits and increased hop count. Direct‑topology designs reduce diameter and hop count, cutting switch count by ~40% and achieving three‑hop end‑to‑end communication for 100 k‑node clusters.

An intelligent lossless algorithm (iLoss‑less) replaces expert‑tuned configurations, dynamically adjusting queue‑level resources based on real‑time traffic predictions to prevent cross‑traffic interference in hyper‑converged Ethernet environments. Experiments with a 3‑switch spine‑leaf network (each leaf serving 16 × 100 GE servers) demonstrated over 20% reduction in overall compute latency while maintaining storage performance.

Additional High‑Performance Network Benchmarks

Using OSU MPI benchmarks on the same 12‑GPU cluster, All‑Reduce communication time improved up to 39.47% (relative to a baseline FT topology) and All‑to‑All communication time improved up to 56.53%.

These results collectively illustrate how tailored load‑balancing, fast data‑plane failover, and intelligent lossless traffic management can substantially boost AI training efficiency and data‑center network reliability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Data Center AI training network load balancing DPFF NSLB

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.