Tencent Star Network: High‑Performance GPU Cluster Architecture for Large‑Scale AI Model Training
Tencent’s Star Network delivers a 1.6 Tbps Ethernet‑RDMA fabric, fat‑tree topology supporting up to 4 K GPUs, multi‑track traffic aggregation and adaptive heterogeneous links plus a custom TCCL library, cutting AllReduce overhead from 35 % to 3.7 %, speeding AI training iterations by 32 % while automating deployment and providing sub‑second self‑healing.
Recent breakthroughs in AIGC (ChatGPT, code generation, novel writing, etc.) rely on massive large‑model training that requires long‑running, large‑scale GPU clusters. The performance, reliability and cost of the underlying network become critical bottlenecks.
This article introduces the three mainstream GPU‑cluster network routes in the industry and presents Tencent’s own solution – the Star Network – which is designed to meet the extreme demands of AI training workloads.
Key technical features of the Star Network :
1.6 Tbps ultra‑high‑bandwidth Ethernet RDMA fabric, providing more than 10× communication speedup for AllReduce and All‑to‑All patterns.
Fat‑Tree topology with support for up to 4 K GPUs per cluster (scalable to 64 K GPUs).
Multi‑track traffic aggregation that groups NICs belonging to the same rack into a single ToR switch, achieving >80 % traffic aggregation efficiency.
Heterogeneous network adaptive communication that jointly exploits inter‑node (NIC + switch) and intra‑node (NVLink/NVSwitch) links, delivering ~30 % performance gain for All‑to‑All at typical message sizes.
Custom collective communication library (TCCL) built on NCCL, tuned for the Star hardware, delivering ~40 % acceleration for AllReduce, AllGather and ReduceScatter.
Performance measurements on GPT‑3 and T5‑MoE models show that the 1.6 Tbps fabric reduces communication overhead from 35 % to 3.7 % (AllReduce) and cuts iteration time by 32 %, effectively increasing cluster compute power by 48 %.
Beyond raw bandwidth, the solution includes a fully automated deployment pipeline that integrates NUMA, PCIe, NVSwitch, NIC and switch configuration, provides one‑click fault localization, and supports automatic health monitoring via Service Telemetry.
Operational features:
End‑to‑end network deployment integration reduces cluster rollout time from 19 days to 4.5 days with 100 % configuration accuracy.
One‑click fault diagnosis distinguishes between network‑side and application‑side issues, automatically isolates problematic NICs or switches, and triggers deterministic path switching using a hash‑offset algorithm to achieve sub‑second self‑healing.
Comprehensive validation steps (hardware checks, RDMA tests, collective library benchmarks, model‑level reliability tests) ensure that only fully verified clusters are delivered.
Looking forward, the Star Network will be offered as a public‑cloud service on Tencent Cloud, paired with the A800 HCC 1.6 T instance, and will continue to evolve in bandwidth, heterogeneous communication, custom libraries and intelligent monitoring to support ever‑larger AI models.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.