Cloud Native 21 min read

Alibaba Cloud’s Guide to Stable Large‑Scale Kubernetes After OpenAI Crash

After the OpenAI outage caused massive Kubernetes API overload, Alibaba Cloud’s Container Service and Observability teams detail how they reinforce large‑scale K8s clusters with high‑availability control‑plane design, optimized Prometheus probing, out‑of‑band monitoring, and best‑practice guidelines for capacity planning, safe releases, and rapid incident response.

Alibaba Cloud Observability

Jan 13, 2025

Alibaba Cloud’s Guide to Stable Large‑Scale Kubernetes After OpenAI Crash

01 Preface

Kubernetes (K8s) has become the mainstream IT architecture. The CNCF survey shows it as the de‑facto standard, and the official recommendation limits a cluster to 5,000 nodes. OpenAI pushed this to 7,500 nodes, exposing bottlenecks and fragility in large‑scale clusters.

02 OpenAI Incident Analysis

On December 11, OpenAI’s ChatGPT, Sora, and API suffered a severe outage. The root cause was a new observability feature that queried the K8s Resource API on every node, creating massive API load, crashing the control plane, and breaking DNS service discovery.

Alibaba Cloud’s Container Service (ACK) and Observability (Prometheus, Telemetry) teams use this case to illustrate their large‑scale K8s stability construction and share mitigation strategies for similar failures.

03 Risks of Large‑Scale K8s Clusters

K8s clusters are distributed systems; any component bottleneck can affect overall stability. The control plane (apiserver, etcd, scheduler, controller‑manager, cloud‑controller‑manager) handles API, scheduling, and resource management, while the data plane (kubelet, kube‑proxy, logging, monitoring, security) runs workloads.

Identifying and optimizing bottlenecks is essential for stable, efficient utilization of cloud resources.

04 Stability Enhancements for Large‑Scale ACK Clusters

4.1 High‑Availability Control Plane

ACK deploys control‑plane components across multiple AZs and nodes with multi‑replica HA. In a 3‑AZ region, SLA is 99.95%; in single‑AZ regions, SLA is 99.5%.

Components support VPA+HPA auto‑scaling, etcd VPA based on recommended resource profiles, and dynamic rate‑limiting.

4.2 Optimized Prometheus for Large Clusters

Prometheus adopts a two‑role architecture: Master handles service discovery, scheduling, and config distribution; Worker performs efficient metric collection and relabeling.

Standard resources are fetched via binary Protobuf for performance.

Only one Pod establishes List & Watch with the API server, reducing load.

Optional active‑standby replicas provide HA.

Worker instances scale out based on target count, and because they do not talk directly to the API server, scaling does not increase API load.

4.3 Out‑of‑Band Monitoring

Control‑plane metrics are exported via a dedicated out‑of‑band link, ensuring monitoring data remains available even when the cluster is unhealthy.

Node‑level events (ECS, OS, hardware) are also sent through out‑of‑band channels to SLS event centers.

05 Best Practices for Operating Large‑Scale K8s

5.1 Capacity Planning & Safe Release

Control cluster size through micro‑service decomposition and capacity planning. Adopt gray‑release, rollback, and continuous monitoring practices; abort deployments immediately upon detecting anomalies.

5.2 Pre‑emptive Observability & Alerting

Use ACK control‑plane dashboards and log monitoring to gain real‑time visibility. Enable Alibaba Cloud’s built‑in alert templates for critical components such as API‑server overload and SLB bandwidth saturation.

5.3 Development Best Practices

When performing full LIST, set resourceVersion=0 to read from the apiserver cache and paginate with limit to avoid etcd pressure.

Prefer Protobuf over JSON for non‑CRD resources.

Use Informer (LIST+WATCH) instead of frequent LIST calls.

Throttle client LIST/WATCH frequency.

Introduce relay components to aggregate API requests from DaemonSets or ECI pods.

5.4 Post‑Incident Recovery

Conduct regular disaster‑recovery drills, enforce rate‑limiting on high‑frequency requests, avoid permanent CoreDNS caching, and quickly downgrade offending workloads using fine‑grained API‑server metrics.

06 Summary

K8s is the industry‑standard infrastructure, and Prometheus is the de‑facto monitoring solution. Large‑scale clusters inevitably face risks; Alibaba Cloud continuously learns from incidents like OpenAI’s, refines high‑availability designs, out‑of‑band observability, and provides best‑practice guidance to deliver a more stable and reliable foundation for users.

Kubernetes Prometheus Alibaba Cloud Cluster Stability Large-Scale Clusters

Written by

Alibaba Cloud Observability

Driving continuous progress in observability technology!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.