Tagged articles
6 articles
Page 1 of 1
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Mar 10, 2025 · Artificial Intelligence

How Baidu Baige Achieves Near‑Zero Downtime in Massive AI Model Training

The article examines how Baidu Baige evolved AI training stability from manual operations to precise engineering, detailing metrics, fault‑perception techniques, eBPF‑based diagnostics, multi‑level restart strategies, and trigger‑based checkpointing that together achieve sub‑minute recovery and 99.5% effective training time on massive GPU clusters.

AI trainingLarge-Scale Clusterscheckpointing
0 likes · 25 min read
How Baidu Baige Achieves Near‑Zero Downtime in Massive AI Model Training
Alibaba Cloud Observability
Alibaba Cloud Observability
Jan 13, 2025 · Cloud Native

Alibaba Cloud’s Guide to Stable Large‑Scale Kubernetes After OpenAI Crash

After the OpenAI outage caused massive Kubernetes API overload, Alibaba Cloud’s Container Service and Observability teams detail how they reinforce large‑scale K8s clusters with high‑availability control‑plane design, optimized Prometheus probing, out‑of‑band monitoring, and best‑practice guidelines for capacity planning, safe releases, and rapid incident response.

Alibaba CloudKubernetesLarge-Scale Clusters
0 likes · 21 min read
Alibaba Cloud’s Guide to Stable Large‑Scale Kubernetes After OpenAI Crash
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 8, 2025 · Cloud Native

Ensuring Massive Kubernetes Cluster Stability: Lessons from the OpenAI Outage

Using the recent OpenAI service disruption as a case study, this article examines the stability challenges of large‑scale Kubernetes deployments and details how Alibaba Cloud Container Service and its Prometheus‑based observability solutions enhance reliability through high‑availability architecture, optimized exporters, out‑of‑band data links, and best‑practice guidelines.

Alibaba CloudLarge-Scale ClustersPrometheus
0 likes · 22 min read
Ensuring Massive Kubernetes Cluster Stability: Lessons from the OpenAI Outage
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Dec 25, 2024 · Cloud Native

Ensuring Stability of Large‑Scale Kubernetes Clusters: Lessons from the OpenAI Incident and Alibaba Cloud Practices

This article analyses the OpenAI large‑scale Kubernetes outage, explains the inherent risks of massive K8s clusters, and presents Alibaba Cloud's architectural enhancements, observability improvements, and best‑practice guidelines to achieve high‑availability and reliable operation of thousands‑node Kubernetes environments.

Cloud NativeKubernetesLarge-Scale Clusters
0 likes · 21 min read
Ensuring Stability of Large‑Scale Kubernetes Clusters: Lessons from the OpenAI Incident and Alibaba Cloud Practices
Architect's Guide
Architect's Guide
Oct 22, 2022 · Big Data

Meituan’s Kafka Optimizations: Reducing Read/Write Latency and Managing Large‑Scale Clusters

This article describes how Meituan’s data platform tackles the growing challenges of a 15,000‑plus‑node Kafka deployment by detailing current bottlenecks, latency‑reduction techniques across application and system layers, large‑scale cluster management strategies, and future directions for robustness and cloud‑native migration.

Big DataKafkaLarge-Scale Clusters
0 likes · 21 min read
Meituan’s Kafka Optimizations: Reducing Read/Write Latency and Managing Large‑Scale Clusters
JD Retail Technology
JD Retail Technology
Jul 20, 2018 · Cloud Native

How JD Built the World’s Largest Kubernetes Cluster to Support Trillion‑Scale E‑commerce Transactions

The article describes JD’s experience of redesigning Kubernetes at massive scale, detailing the JDOS2.0 platform, custom DNS and load‑balancing, the Archimedes scheduler, API and controller optimizations, and operational lessons learned from running tens of thousands of nodes in production.

JDOSKubernetesLarge-Scale Clusters
0 likes · 16 min read
How JD Built the World’s Largest Kubernetes Cluster to Support Trillion‑Scale E‑commerce Transactions