Tencent Cloud Developer
Mar 22, 2023 · Artificial Intelligence
Tencent Star Network: High‑Performance GPU Cluster Architecture for Large‑Scale AI Model Training
Tencent’s Star Network delivers a 1.6 Tbps Ethernet‑RDMA fabric, fat‑tree topology supporting up to 4 K GPUs, multi‑track traffic aggregation and adaptive heterogeneous links plus a custom TCCL library, cutting AllReduce overhead from 35 % to 3.7 %, speeding AI training iterations by 32 % while automating deployment and providing sub‑second self‑healing.
AI trainingGPU clustersRDMA
0 likes · 19 min read
