Tagged articles

GPU Cluster

17 articles · Page 1 of 1

Jun 29, 2026 · Artificial Intelligence

Open‑Source AI‑Infra Ops Agent Benchmark Powered by Hundreds of Billions of Real Data

The article introduces AISHPerf, the first open‑source benchmark for AI‑infra operations agents built on nearly a hundred‑billion real‑world ops records, detailing its data pipeline, multi‑layer coverage, evaluation metrics, experimental results that show current models lag behind human experts, and future plans to expand and refine the benchmark.

AI OpsEvaluation MetricsFault Injection

0 likes · 16 min read

Open‑Source AI‑Infra Ops Agent Benchmark Powered by Hundreds of Billions of Real Data

Machine Heart

May 21, 2026 · Industry Insights

How ZCube Redefines 20‑Year‑Old Networking Logic to Boost GPU Throughput by 15%

ZCube, a new flat networking architecture deployed by Zhipu in its GLM‑5.1 inference cluster, eliminates structural congestion, delivering a 15% throughput gain, 40.6% latency reduction, and one‑third lower hardware cost without adding GPUs, signaling a shift from raw compute to system efficiency in AI infrastructure.

AI networkingGPU ClusterMRC protocol

0 likes · 15 min read

How ZCube Redefines 20‑Year‑Old Networking Logic to Boost GPU Throughput by 15%

21CTO

May 7, 2026 · Industry Insights

Why Musk Is Merging xAI into SpaceXAI to Weaponize Compute Against OpenAI

Elon Musk dissolved xAI, rebranded it as SpaceXAI, and granted Anthropic access to the 220,000‑GPU Colossus 1 supercomputer, a move framed as a strategic strike to undermine OpenAI by leveraging massive orbital‑grade compute power.

AI supercomputingAnthropicColossus

0 likes · 12 min read

Why Musk Is Merging xAI into SpaceXAI to Weaponize Compute Against OpenAI

Raymond Ops

Nov 4, 2025 · Artificial Intelligence

How to Deploy GPUStack with Docker for Scalable AI Model Serving

This guide walks you through installing NVIDIA drivers and Docker, configuring the NVIDIA Container Toolkit, and deploying GPUStack in Docker to manage heterogeneous GPU resources, run large language, multimodal, diffusion, and embedding models, and scale from a single node to a multi‑node GPU cluster.

AI model deploymentDockerGPU Cluster

0 likes · 15 min read

How to Deploy GPUStack with Docker for Scalable AI Model Serving

MaGe Linux Operations

Jun 3, 2025 · Artificial Intelligence

How to Deploy GPUStack with Docker for Scalable AI Model Serving

This guide walks you through installing NVIDIA drivers, Docker, and the NVIDIA Container Toolkit, then shows step‑by‑step how to run GPUStack in Docker, expand a GPU cluster, and serve large language, multimodal, diffusion, and embedding models with OpenAI‑compatible APIs.

AI model deploymentDockerGPU Cluster

0 likes · 15 min read

Baidu Intelligent Cloud Tech Hub

May 23, 2025 · Artificial Intelligence

How Baidu’s Kunlun Supernode Redefines AI Compute Density and Performance

This article explains how Baidu’s Kunlun supernode, built on high‑density liquid‑cooled cabinets and a modular 1U 4‑card design, breaks traditional 8‑card limits, boosts compute density four‑fold, improves power and cooling efficiency, and provides a scalable foundation for large‑model AI training and inference.

AI InfrastructureGPU ClusterHigh-performance computing

0 likes · 13 min read

How Baidu’s Kunlun Supernode Redefines AI Compute Density and Performance

Architects' Tech Alliance

Apr 26, 2025 · Industry Insights

Can Huawei’s CloudMatrix 384 Outpace Nvidia’s GB200? A Deep Dive into China’s AI Supernode

The article provides a detailed technical analysis of Huawei's CloudMatrix 384 AI supernode—its 384 Ascend 910C chips, 300 PFLOP BF16 performance, massive memory and bandwidth, power consumption, scale‑up and scale‑out optical networking, and how it compares to Nvidia's GB200 NVL72 in architecture, cost, and energy efficiency.

AI hardwareCloudMatrixGPU Cluster

0 likes · 12 min read

Can Huawei’s CloudMatrix 384 Outpace Nvidia’s GB200? A Deep Dive into China’s AI Supernode

ByteDance Cloud Native

Mar 20, 2025 · Artificial Intelligence

How to Deploy DeepSeek‑R1 671B on AIBrix: Multi‑Node GPU Inference in Hours

This guide explains how to use the AIBrix distributed inference platform to deploy the massive DeepSeek‑R1 671B model across multiple GPU nodes, covering cluster setup, custom vLLM images, storage options, RDMA networking, autoscaling, request handling, and observability, turning a weeks‑long deployment into an hour‑scale process.

AIBrixDeepSeek-R1Distributed Inference

0 likes · 14 min read

How to Deploy DeepSeek‑R1 671B on AIBrix: Multi‑Node GPU Inference in Hours

DevOps

Nov 27, 2024 · Artificial Intelligence

Elon Musk’s Colossus Supercomputer: Building 100,000 GPUs in 122 Days and Its Impact on AI Infrastructure

The article analyzes Elon Musk’s Colossus AI supercomputer—its 100,000 NVIDIA H100 GPUs, record‑fast 122‑day construction, vertical‑integration strategy, and the broader implications for U.S. AI infrastructure dominance and China’s competing challenges in funding and chip supply.

AI InfrastructureAI StrategyElon Musk

0 likes · 13 min read

Elon Musk’s Colossus Supercomputer: Building 100,000 GPUs in 122 Days and Its Impact on AI Infrastructure

360 Tech Engineering

Oct 15, 2024 · Artificial Intelligence

Implementation and Optimization of 360 AI Compute Center: Infrastructure, Network, Kubernetes, and Training/Inference Acceleration

The article details the design and deployment of 360's AI Compute Center, covering GPU server selection, high‑performance networking, Kubernetes‑based cluster management, advanced scheduling, training and inference acceleration techniques, and a comprehensive AI development platform with visualization and fault‑tolerance features.

AI InfrastructureDistributed ComputingGPU Cluster

0 likes · 21 min read

Implementation and Optimization of 360 AI Compute Center: Infrastructure, Network, Kubernetes, and Training/Inference Acceleration

360 Zhihui Cloud Developer

Oct 11, 2024 · Artificial Intelligence

How 360 Built a Thousand‑GPU AI Supercomputer with Kubernetes and Advanced Scheduling

This article details the design and implementation of 360’s AI Computing Center, covering server selection, network topology, Kubernetes scheduling, training and inference acceleration, and the AI platform’s core, visualization, and fault‑tolerance capabilities for large‑scale AI workloads.

AI InfrastructureGPU Clusterdistributed training

0 likes · 22 min read

How 360 Built a Thousand‑GPU AI Supercomputer with Kubernetes and Advanced Scheduling

Architects' Tech Alliance

May 19, 2024 · Industry Insights

How to Build a 10,000‑GPU Supercluster: Core Design Principles and Architecture

This article analyzes the challenges and solutions for constructing a super‑large GPU training cluster, outlining five fundamental design principles, a four‑layer plus one‑domain architecture, and practical considerations for hardware, networking, and operational reliability in AI workloads.

AI trainingGPU ClusterHigh-performance computing

0 likes · 8 min read

How to Build a 10,000‑GPU Supercluster: Core Design Principles and Architecture

Architects' Tech Alliance

May 16, 2024 · Industry Insights

How to Build a Multi‑Petabyte AI Super‑Cluster: Scaling Beyond Ten‑Thousand GPUs

This article analyzes the architectural upgrades required for ultra‑large AI clusters, covering single‑GPU performance, super‑node scaling, DPU‑based heterogeneous computing, power‑efficiency, high‑throughput storage, and robust high‑speed networking to support trillion‑parameter model training and inference.

AIDPUGPU Cluster

0 likes · 17 min read

How to Build a Multi‑Petabyte AI Super‑Cluster: Scaling Beyond Ten‑Thousand GPUs

Baidu Tech Salon

May 11, 2023 · Artificial Intelligence

Inside Baidu’s High‑Performance GPU Cluster: Powering the Next‑Gen AI Models

The article details Baidu's development of a massive high‑performance GPU/IB cluster, its architectural design, the challenges of training trillion‑parameter models, and how the integrated AI stack—spanning hardware, framework, and resource management—overcomes compute, memory, and communication bottlenecks to accelerate large‑model training.

AI InfrastructureBaidu AI BaseGPU Cluster

0 likes · 17 min read

Inside Baidu’s High‑Performance GPU Cluster: Powering the Next‑Gen AI Models

Baidu Geek Talk

May 10, 2023 · Artificial Intelligence

Baidu's AI Infrastructure for Large-Scale LLM Training: Architecture, Challenges, and Optimization

Baidu’s AI infrastructure combines a massive InfiniBand‑linked GPU cluster, Kunlun chips, the PaddlePaddle framework, and the Wenxin model suite with 4D hybrid parallelism, elastic fault tolerance, and a two‑stage training pipeline to overcome computation, memory, and communication walls, delivering world‑leading MLPerf performance for large‑scale LLMs.

GPU ClusterInfiniBandLarge Language Model

0 likes · 15 min read

Baidu's AI Infrastructure for Large-Scale LLM Training: Architecture, Challenges, and Optimization

Tencent Cloud Developer

Apr 14, 2023 · Artificial Intelligence

Tencent Cloud's Next-Generation HCC High-Performance Computing Cluster for Large Model Training

Tencent Cloud's new HCC high‑performance computing cluster triples previous generation performance with 3.2 TB/s server bandwidth, Xingsha servers and NVIDIA H800 GPUs delivering up to 1979 TFlops, while its Xingmai 3.2 T ETH RDMA network, TB‑level storage via COS + GooseFS, and multi‑form access (bare metal, cloud servers, containers, functions) enable efficient large‑model training.

AI computingGPU ClusterHigh-performance computing

0 likes · 9 min read

Tencent Cloud's Next-Generation HCC High-Performance Computing Cluster for Large Model Training

DataFunSummit

Mar 9, 2021 · Artificial Intelligence

Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions

This article details Weibo's multimodal content understanding platform, covering its massive data challenges, heterogeneous model support, standardized pipelines, platformization, workflow architecture, GPU heterogeneous cluster management, resource scheduling, performance optimization, and full‑stack monitoring to achieve stable, low‑latency AI services at scale.

GPU ClusterMultimodal AIReal-time inference

0 likes · 18 min read

Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions