Baidu Intelligent Cloud Tech Hub
Author

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

130
Articles
0
Likes
99
Views
0
Comments
Recent Articles

Latest from Baidu Intelligent Cloud Tech Hub

100 recent articles max
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jul 25, 2025 · Operations

How Baidu’s Lingxi Agent Uses LLMs to Automate Network Fault Diagnosis

This article details Baidu's evolution from manual network fault analysis to a multi‑agent AI platform, describing how the Lingxi intelligent agent leverages large language models, MCP tools, and design patterns to automate latency queries, generate analysis reports, and integrate with existing monitoring services.

AI agentsMCP protocolnetwork operations
0 likes · 19 min read
How Baidu’s Lingxi Agent Uses LLMs to Automate Network Fault Diagnosis
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
May 23, 2025 · Artificial Intelligence

How Baidu’s Kunlun Supernode Redefines AI Compute Density and Performance

This article explains how Baidu’s Kunlun supernode, built on high‑density liquid‑cooled cabinets and a modular 1U 4‑card design, breaks traditional 8‑card limits, boosts compute density four‑fold, improves power and cooling efficiency, and provides a scalable foundation for large‑model AI training and inference.

AI infrastructureGPU ClusterLiquid cooling
0 likes · 13 min read
How Baidu’s Kunlun Supernode Redefines AI Compute Density and Performance
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
May 16, 2025 · Artificial Intelligence

How Baidu Cloud Achieved 4µs End-to-End Latency for Large-Scale PD Inference

Baidu Intelligent Cloud built a 4µs end-to-end low‑latency HPN cluster, optimized traffic management and communication operators, and introduced dynamic expert balancing to dramatically improve the performance of large‑scale PD‑separated inference services, showcasing the deep integration of network infrastructure with AI workloads.

AI inferenceAll-to-AllHPN
0 likes · 14 min read
How Baidu Cloud Achieved 4µs End-to-End Latency for Large-Scale PD Inference
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Apr 25, 2025 · Operations

How RapidFS Accelerates AI Model Training with 10 TiB/s Storage Performance

The article explains how RapidFS, a near‑compute storage acceleration solution built on BOS object storage, delivers up to 10 TiB/s throughput for massive AI model training, detailing its architecture, deployment on a 30,000‑card Kunlun cluster, and performance test results that show linear scaling from 20 to 70 nodes.

AI trainingPerformance TestingRapidFS
0 likes · 6 min read
How RapidFS Accelerates AI Model Training with 10 TiB/s Storage Performance
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Apr 18, 2025 · Operations

How Baidu’s AI‑Powered Digital Immune System Reinvents SRE Risk Management

This article explains why modern SRE teams need a digital immune system, describes Baidu’s data‑driven approach to improve system resilience, outlines the three‑phase evolution from digital transformation to AI‑enhanced risk mining, and shares concrete results and future directions for sustainable operations.

AIDigital Immune SystemSRE
0 likes · 15 min read
How Baidu’s AI‑Powered Digital Immune System Reinvents SRE Risk Management
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Mar 10, 2025 · Artificial Intelligence

How Baidu Baige Achieves Near‑Zero Downtime in Massive AI Model Training

The article examines how Baidu Baige evolved AI training stability from manual operations to precise engineering, detailing metrics, fault‑perception techniques, eBPF‑based diagnostics, multi‑level restart strategies, and trigger‑based checkpointing that together achieve sub‑minute recovery and 99.5% effective training time on massive GPU clusters.

AI trainingLarge-Scale Clusterscheckpointing
0 likes · 25 min read
How Baidu Baige Achieves Near‑Zero Downtime in Massive AI Model Training
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Mar 3, 2025 · Cloud Computing

How Baidu Cloud Optimizes GPU Servers for AI Workloads

This article explains the design and implementation of GPU cloud servers, covering data processing pipelines, hardware selection, topology, interconnect technologies, virtualization, multi‑GPU communication methods, and Baidu's practical solutions for both virtualized and bare‑metal instances to boost AI inference and training performance.

AIGPUNVLink
0 likes · 29 min read
How Baidu Cloud Optimizes GPU Servers for AI Workloads