Tagged articles
2 articles
Page 1 of 1
Bilibili Tech
Bilibili Tech
Dec 31, 2024 · Cloud Computing

Design and Implementation of Bilibili AI Compute Network: Topology, Hardware Selection, Load Balancing, and Monitoring

Bilibili designed and deployed an AI compute network for large language model training, choosing a Fat-Tree topology, selecting high‑speed switches, optical modules, and fibers, implementing fixed‑path load balancing, and building a sub‑second telemetry monitoring platform, with plans to scale to ten‑thousand GPUs.

AI compute networkFat-Tree topologyhardware selection
0 likes · 17 min read
Design and Implementation of Bilibili AI Compute Network: Topology, Hardware Selection, Load Balancing, and Monitoring
Architects' Tech Alliance
Architects' Tech Alliance
Sep 9, 2023 · Industry Insights

Can NSLB Double AI Training Speed? Inside the 113% Performance Gain Over ECMP

The article analyzes AI‑training traffic patterns, critiques existing flow‑based, flowlet‑based, and packet‑based ECMP load‑balancing, introduces the NSLB solution tailored for AI clusters, and presents experimental results showing up to 113% speed improvement and sub‑millisecond failover with DPFF, while also discussing direct‑topology and intelligent lossless networking techniques.

AI trainingDPFFNSLB
0 likes · 11 min read
Can NSLB Double AI Training Speed? Inside the 113% Performance Gain Over ECMP