Artificial Intelligence 22 min read

How 360 Built a Thousand‑GPU AI Supercomputer with Kubernetes and Advanced Scheduling

This article details the design and implementation of 360’s AI Computing Center, covering server selection, network topology, Kubernetes scheduling, training and inference acceleration, and the AI platform’s core, visualization, and fault‑tolerance capabilities for large‑scale AI workloads.

360 Zhihui Cloud Developer

Oct 11, 2024

How 360 Built a Thousand‑GPU AI Supercomputer with Kubernetes and Advanced Scheduling

Infrastructure Construction

360 AI Computing Center integrates AI, heterogeneous computing, big data, high‑performance networking and an AI development platform to provide efficient, intelligent compute power for complex AI tasks.

1.1 Server Selection

Example topology for A100/A800 includes 2 CPUs, 2 storage NICs, 4 PCIe Gen4 Switch chips, 6 NVSwitch chips, 8 GPUs, and 4 InfiniBand NICs.

Two 25 Gb/s storage NICs are bonded (bond4) to achieve 50 Gb/s bandwidth, and software optimizations such as distributed checkpoint storage and multi‑stage asynchronous saving reduce checkpoint time from 383 s to 5 s, a 70‑fold improvement.

Eight GPUs are fully interconnected via six NVSwitch chips, providing up to 600 GB/s (A100) or 400 GB/s (A800) NVLink bandwidth; tests show NVLink is not a bottleneck for thousand‑GPU training.

Four 200 Gb/s Mellanox CX6 NICs are used; with PCIe Gen4 x16 (32 GB/s) they match the NIC bandwidth. Experiments show that using four NICs yields similar performance to eight NICs while reducing cost. Enabling GPU Direct RDMA can improve training speed by up to 50 %.

Network Construction

The center’s traffic pattern emphasizes east‑west (intra‑data‑center) flow, with north‑south as secondary.

Architecture follows NVIDIA DGX‑SuperPod‑A100: each Scalable Unit (SU) contains 200 A800 GPUs and 4 leaf switches; leaf and spine layers are fully meshed, allowing the cluster to survive a single spine failure.

GPU communication paths: intra‑node NVLink, intra‑SU same‑leaf, intra‑SU cross‑spine, inter‑SU cross‑spine. Scaling beyond 200 GPUs requires adding a third‑level core compute switch.

Kubernetes Cluster Construction

Scheduling Capabilities

Gang scheduling : “all‑or‑nothing” to avoid deadlock.

BinPack scheduling : packs fragmented tasks onto the same node.

Priority & preemption : six priority levels (P0‑P5) with high‑priority preempting low‑priority tasks.

Network‑topology‑aware scheduling : places communication‑intensive tasks on high‑bandwidth, low‑latency zones, yielding >20 % gains in some cases.

Delayed scheduling : prevents low‑priority tasks from starving large‑resource jobs.

Heterogeneous‑compute scheduling : supports NVIDIA GPUs, Ascend chips, X86 and ARM architectures.

Network Solutions

Both RoCE v2 and InfiniBand are deployed. Two storage NICs (lan0, lan1) are bonded for management traffic; four IB NICs (lan2‑lan5) handle data‑plane traffic.

Network‑operator components include mofed (Mellanox driver), rdma‑shared‑device‑plugin , and a secondary CNI (multus‑cni, container‑networking‑plugins, whereabouts) to expose mlx devices to pods.

Training and Inference Acceleration

QLM Training Acceleration

Qihoo Large Language Model (QLM) is a Megatron‑LM‑based framework optimized for the thousand‑GPU cluster, achieving >47 % MFU for MoE models and >56 % for dense models, with dense training speed of 175 TFLOPS (8× improvement).

Supports >1,000‑GPU training and long‑text (360 K tokens) workloads.

Seamless conversion from Hugging Face formats.

Real‑time visual monitoring of loss, learning‑rate, performance.

Comprehensive profiling and model evaluation tools.

Flexible fine‑tuning capabilities.

GLLM Inference Acceleration

Gaia Large Language Module (GLLM) is a multi‑platform inference engine (NVIDIA, Ascend, etc.) that outperforms VLLM by >10 %.

Continuous batching for higher GPU utilization.

PageAttention to improve memory efficiency for long texts.

PrefixCache to reduce latency for repeated prefixes.

AI Platform Construction

Core Capabilities

Interactive modeling : Jupyter, VSCode, multi‑tenant collaboration.

Distributed training : supports thousand‑GPU jobs, 3D parallelism, auto‑healing.

Online deployment : auto‑scaling APIs, real‑time monitoring.

Resource pool management : fair multi‑tenant scheduling, load balancing, elastic scaling.

Optimization mechanisms : low‑efficiency task killing, pre‑emptive scheduling, improving overall utilization by >25 %.

Visualization

Cluster resource dashboards (node, network, disk, IO).

Training resource monitors (GPU/CPU usage, memory, temperature, SM utilization).

Task‑level metrics (duration, success rate, TFLOPS, token throughput).

Training process charts (loss, metrics, hyper‑parameter comparison, gradient visualizations).

Fault Tolerance

QihooSMI detects and self‑heals runtime, hardware, network, and slow‑node failures. It can locate a faulty NIC service, GPU ECC errors, or a slow node within minutes, automatically isolate and recover the cluster, keeping training jobs running without manual intervention.

Conclusion and Outlook

The 360 AI Computing Center’s thousand‑GPU cluster combines AI, big data, heterogeneous computing and high‑performance networking to meet large‑scale AI workloads. Future work will expand cluster size, enhance heterogeneous support, improve platform visualization and monitoring, and continue energy‑efficient innovations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kubernetes Large Language Models distributed training AI Infrastructure GPU Cluster

Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.