Tagged articles

Large-Scale Clusters

6 articles · Page 1 of 1

Mar 10, 2025 · Artificial Intelligence

How Baidu Baige Achieves Near‑Zero Downtime in Massive AI Model Training

The article examines how Baidu Baige evolved AI training stability from manual operations to precise engineering, detailing metrics, fault‑perception techniques, eBPF‑based diagnostics, multi‑level restart strategies, and trigger‑based checkpointing that together achieve sub‑minute recovery and 99.5% effective training time on massive GPU clusters.

AI trainingLarge-Scale Clusterscheckpointing

0 likes · 25 min read

How Baidu Baige Achieves Near‑Zero Downtime in Massive AI Model Training

Alibaba Cloud Observability

Jan 13, 2025 · Cloud Native

Alibaba Cloud’s Guide to Stable Large‑Scale Kubernetes After OpenAI Crash

After the OpenAI outage caused massive Kubernetes API overload, Alibaba Cloud’s Container Service and Observability teams detail how they reinforce large‑scale K8s clusters with high‑availability control‑plane design, optimized Prometheus probing, out‑of‑band monitoring, and best‑practice guidelines for capacity planning, safe releases, and rapid incident response.

Alibaba CloudCluster stabilityKubernetes

0 likes · 21 min read

Alibaba Cloud’s Guide to Stable Large‑Scale Kubernetes After OpenAI Crash

Alibaba Cloud Developer

Jan 8, 2025 · Cloud Native

Ensuring Massive Kubernetes Cluster Stability: Lessons from the OpenAI Outage

Using the recent OpenAI service disruption as a case study, this article examines the stability challenges of large‑scale Kubernetes deployments and details how Alibaba Cloud Container Service and its Prometheus‑based observability solutions enhance reliability through high‑availability architecture, optimized exporters, out‑of‑band data links, and best‑practice guidelines.

Alibaba CloudLarge-Scale ClustersObservability

0 likes · 22 min read

Ensuring Massive Kubernetes Cluster Stability: Lessons from the OpenAI Outage

Alibaba Cloud Infrastructure

Dec 25, 2024 · Cloud Native

Ensuring Stability of Large‑Scale Kubernetes Clusters: Lessons from the OpenAI Incident and Alibaba Cloud Practices

This article analyses the OpenAI large‑scale Kubernetes outage, explains the inherent risks of massive K8s clusters, and presents Alibaba Cloud's architectural enhancements, observability improvements, and best‑practice guidelines to achieve high‑availability and reliable operation of thousands‑node Kubernetes environments.

Cloud NativeHigh AvailabilityKubernetes

0 likes · 21 min read

Ensuring Stability of Large‑Scale Kubernetes Clusters: Lessons from the OpenAI Incident and Alibaba Cloud Practices

Architect's Guide

Oct 22, 2022 · Big Data

Meituan’s Kafka Optimizations: Reducing Read/Write Latency and Managing Large‑Scale Clusters

This article describes how Meituan’s data platform tackles the growing challenges of a 15,000‑plus‑node Kafka deployment by detailing current bottlenecks, latency‑reduction techniques across application and system layers, large‑scale cluster management strategies, and future directions for robustness and cloud‑native migration.

Big DataKafkaLarge-Scale Clusters

0 likes · 21 min read

Meituan’s Kafka Optimizations: Reducing Read/Write Latency and Managing Large‑Scale Clusters

JD Retail Technology

Jul 20, 2018 · Cloud Native

How JD Built the World’s Largest Kubernetes Cluster to Support Trillion‑Scale E‑commerce Transactions

The article describes JD’s experience of redesigning Kubernetes at massive scale, detailing the JDOS2.0 platform, custom DNS and load‑balancing, the Archimedes scheduler, API and controller optimizations, and operational lessons learned from running tens of thousands of nodes in production.

Container OrchestrationJDOSKubernetes

0 likes · 16 min read

How JD Built the World’s Largest Kubernetes Cluster to Support Trillion‑Scale E‑commerce Transactions