How 360 Built a Thousand‑GPU AI Supercomputer with Kubernetes and Advanced Scheduling
This article details the design and implementation of 360’s AI Computing Center, covering server selection, network topology, Kubernetes scheduling, training and inference acceleration, and the AI platform’s core, visualization, and fault‑tolerance capabilities for large‑scale AI workloads.
Infrastructure Construction
360 AI Computing Center integrates AI, heterogeneous computing, big data, high‑performance networking and an AI development platform to provide efficient, intelligent compute power for complex AI tasks.
1.1 Server Selection
Example topology for A100/A800 includes 2 CPUs, 2 storage NICs, 4 PCIe Gen4 Switch chips, 6 NVSwitch chips, 8 GPUs, and 4 InfiniBand NICs.
Two 25 Gb/s storage NICs are bonded (bond4) to achieve 50 Gb/s bandwidth, and software optimizations such as distributed checkpoint storage and multi‑stage asynchronous saving reduce checkpoint time from 383 s to 5 s, a 70‑fold improvement.
Eight GPUs are fully interconnected via six NVSwitch chips, providing up to 600 GB/s (A100) or 400 GB/s (A800) NVLink bandwidth; tests show NVLink is not a bottleneck for thousand‑GPU training.
Four 200 Gb/s Mellanox CX6 NICs are used; with PCIe Gen4 x16 (32 GB/s) they match the NIC bandwidth. Experiments show that using four NICs yields similar performance to eight NICs while reducing cost. Enabling GPU Direct RDMA can improve training speed by up to 50 %.
Network Construction
The center’s traffic pattern emphasizes east‑west (intra‑data‑center) flow, with north‑south as secondary.
Architecture follows NVIDIA DGX‑SuperPod‑A100: each Scalable Unit (SU) contains 200 A800 GPUs and 4 leaf switches; leaf and spine layers are fully meshed, allowing the cluster to survive a single spine failure.
GPU communication paths: intra‑node NVLink, intra‑SU same‑leaf, intra‑SU cross‑spine, inter‑SU cross‑spine. Scaling beyond 200 GPUs requires adding a third‑level core compute switch.
Kubernetes Cluster Construction
Scheduling Capabilities
Gang scheduling : “all‑or‑nothing” to avoid deadlock.
BinPack scheduling : packs fragmented tasks onto the same node.
Priority & preemption : six priority levels (P0‑P5) with high‑priority preempting low‑priority tasks.
Network‑topology‑aware scheduling : places communication‑intensive tasks on high‑bandwidth, low‑latency zones, yielding >20 % gains in some cases.
Delayed scheduling : prevents low‑priority tasks from starving large‑resource jobs.
Heterogeneous‑compute scheduling : supports NVIDIA GPUs, Ascend chips, X86 and ARM architectures.
Network Solutions
Both RoCE v2 and InfiniBand are deployed. Two storage NICs (lan0, lan1) are bonded for management traffic; four IB NICs (lan2‑lan5) handle data‑plane traffic.
Network‑operator components include mofed (Mellanox driver), rdma‑shared‑device‑plugin , and a secondary CNI (multus‑cni, container‑networking‑plugins, whereabouts) to expose mlx devices to pods.
Training and Inference Acceleration
QLM Training Acceleration
Qihoo Large Language Model (QLM) is a Megatron‑LM‑based framework optimized for the thousand‑GPU cluster, achieving >47 % MFU for MoE models and >56 % for dense models, with dense training speed of 175 TFLOPS (8× improvement).
Supports >1,000‑GPU training and long‑text (360 K tokens) workloads.
Seamless conversion from Hugging Face formats.
Real‑time visual monitoring of loss, learning‑rate, performance.
Comprehensive profiling and model evaluation tools.
Flexible fine‑tuning capabilities.
GLLM Inference Acceleration
Gaia Large Language Module (GLLM) is a multi‑platform inference engine (NVIDIA, Ascend, etc.) that outperforms VLLM by >10 %.
Continuous batching for higher GPU utilization.
PageAttention to improve memory efficiency for long texts.
PrefixCache to reduce latency for repeated prefixes.
AI Platform Construction
Core Capabilities
Interactive modeling : Jupyter, VSCode, multi‑tenant collaboration.
Distributed training : supports thousand‑GPU jobs, 3D parallelism, auto‑healing.
Online deployment : auto‑scaling APIs, real‑time monitoring.
Resource pool management : fair multi‑tenant scheduling, load balancing, elastic scaling.
Optimization mechanisms : low‑efficiency task killing, pre‑emptive scheduling, improving overall utilization by >25 %.
Visualization
Cluster resource dashboards (node, network, disk, IO).
Training resource monitors (GPU/CPU usage, memory, temperature, SM utilization).
Task‑level metrics (duration, success rate, TFLOPS, token throughput).
Training process charts (loss, metrics, hyper‑parameter comparison, gradient visualizations).
Fault Tolerance
QihooSMI detects and self‑heals runtime, hardware, network, and slow‑node failures. It can locate a faulty NIC service, GPU ECC errors, or a slow node within minutes, automatically isolate and recover the cluster, keeping training jobs running without manual intervention.
Conclusion and Outlook
The 360 AI Computing Center’s thousand‑GPU cluster combines AI, big data, heterogeneous computing and high‑performance networking to meet large‑scale AI workloads. Future work will expand cluster size, enhance heterogeneous support, improve platform visualization and monitoring, and continue energy‑efficient innovations.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.