How ByteDance Scales Attn/MoE: Cost Models, Mesh Communication, and Network Hacks
The article analyzes ByteDance's MegaScale‑Infer paper, detailing micro‑batching, M:N Attn‑MoE ratios, cost‑driven constraint search, communication redesign with Mesh All‑2‑All, network latency challenges, and innovative NIC and routing solutions for large‑scale mixture‑of‑experts inference.
The author examined ByteDance's paper "MegaScale‑Infer: Serving Mixture‑of‑Experts at Scale with Disaggregated Expert Parallelism" and noted that both ByteDance and Nvidia still require technical assistance despite the paper's innovations.
The paper presents a simple narrative: use micro‑batching and an M:N Attn‑MoE ratio combined with heterogeneous compute resources to lower inference cost.
Increasing batch size reveals a core issue: on low‑power GPUs (e.g., H20) the attention computation becomes memory‑bound and slow, while high‑power GPUs (e.g., H800) suffer low utilization of GroupGEMM for small batches; an 80 GB memory limit further restricts batch size, forcing the use of large‑scale expert parallelism (e.g., 144/320 EP).
ByteDance supplied a cost‑analysis table and built a constraint‑search algorithm that matches model inference compute demand with SLO requirements. The search is straightforward—enumerate all possible configurations, store the results in a table, and query it with pandas for fast lookup.
The main difficulty lies in communication. ByteDance replaces homogeneous All‑2‑All with an M:N Mesh communication pattern. They observed that NCCL tests outperform perftest, especially at the P99 latency percentile, and attribute the root cause to queueing delay, which can be modeled with the Kingman formula. Based on this insight, they introduced a custom communication mechanism, adjusted congestion‑control algorithms, and raised ACK priority.
On the network side, the author points out a design flaw in Mellanox (Nvidia) NICs: even with Adaptive Routing (AR) enabled, receive‑side ReOrder adds several microseconds of latency, and DDP implementations are incomplete. Two years ago the author’s team designed an eRDMA congestion‑control algorithm that explicitly considered AE separation, achieving near‑zero variance while fully utilizing bandwidth.
To avoid RoCE protocol shortcomings, the receive‑side ReOrder was replaced with iWARP DDP, which the author demonstrates with a benchmark where eRDMA still outperforms Nvidia's solution.
When deploying M:N, the expert side generates a large number of QPs, but the DPA has only 16 cores with 16 threads each, creating a compute bottleneck. The solution involves introducing DCT to handle QP scaling; eRDMA’s stateless subflow supports up to 128 K QPs with all paths open. However, congestion still forces a trade‑off between speed‑down and path switching, and implementing a free ReOrder buffer without cross‑core communication remains a challenge.
Additional networking considerations include reducing ECMP hash conflicts by using a multi‑rail topology, opting for high‑cost InfiniBand or cheaper SP4+BF3 with RoCE AR for lossless operation, and a hack using BlueField‑3L NICs to create a lossy multipath scheme where each DPA core probes RTT and dynamically changes UDP source ports.
Reducing ECMP hashing conflicts . The conflict probability is reduced as the bandwidth of each uplink is double of that of a downlink. Second, eight 200G NICs on the server is connected to eight different switches in a multi‑rail way .
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
