Alibaba Cloud’s Guide to Stable Large‑Scale Kubernetes After OpenAI Crash
After the OpenAI outage caused massive Kubernetes API overload, Alibaba Cloud’s Container Service and Observability teams detail how they reinforce large‑scale K8s clusters with high‑availability control‑plane design, optimized Prometheus probing, out‑of‑band monitoring, and best‑practice guidelines for capacity planning, safe releases, and rapid incident response.
01 Preface
Kubernetes (K8s) has become the mainstream IT architecture. The CNCF survey shows it as the de‑facto standard, and the official recommendation limits a cluster to 5,000 nodes. OpenAI pushed this to 7,500 nodes, exposing bottlenecks and fragility in large‑scale clusters.
02 OpenAI Incident Analysis
On December 11, OpenAI’s ChatGPT, Sora, and API suffered a severe outage. The root cause was a new observability feature that queried the K8s Resource API on every node, creating massive API load, crashing the control plane, and breaking DNS service discovery.
Alibaba Cloud’s Container Service (ACK) and Observability (Prometheus, Telemetry) teams use this case to illustrate their large‑scale K8s stability construction and share mitigation strategies for similar failures.
03 Risks of Large‑Scale K8s Clusters
K8s clusters are distributed systems; any component bottleneck can affect overall stability. The control plane (apiserver, etcd, scheduler, controller‑manager, cloud‑controller‑manager) handles API, scheduling, and resource management, while the data plane (kubelet, kube‑proxy, logging, monitoring, security) runs workloads.
Identifying and optimizing bottlenecks is essential for stable, efficient utilization of cloud resources.
04 Stability Enhancements for Large‑Scale ACK Clusters
4.1 High‑Availability Control Plane
ACK deploys control‑plane components across multiple AZs and nodes with multi‑replica HA. In a 3‑AZ region, SLA is 99.95%; in single‑AZ regions, SLA is 99.5%.
Components support VPA+HPA auto‑scaling, etcd VPA based on recommended resource profiles, and dynamic rate‑limiting.
4.2 Optimized Prometheus for Large Clusters
Prometheus adopts a two‑role architecture: Master handles service discovery, scheduling, and config distribution; Worker performs efficient metric collection and relabeling.
Standard resources are fetched via binary Protobuf for performance.
Only one Pod establishes List & Watch with the API server, reducing load.
Optional active‑standby replicas provide HA.
Worker instances scale out based on target count, and because they do not talk directly to the API server, scaling does not increase API load.
4.3 Out‑of‑Band Monitoring
Control‑plane metrics are exported via a dedicated out‑of‑band link, ensuring monitoring data remains available even when the cluster is unhealthy.
Node‑level events (ECS, OS, hardware) are also sent through out‑of‑band channels to SLS event centers.
05 Best Practices for Operating Large‑Scale K8s
5.1 Capacity Planning & Safe Release
Control cluster size through micro‑service decomposition and capacity planning. Adopt gray‑release, rollback, and continuous monitoring practices; abort deployments immediately upon detecting anomalies.
5.2 Pre‑emptive Observability & Alerting
Use ACK control‑plane dashboards and log monitoring to gain real‑time visibility. Enable Alibaba Cloud’s built‑in alert templates for critical components such as API‑server overload and SLB bandwidth saturation.
5.3 Development Best Practices
When performing full LIST, set resourceVersion=0 to read from the apiserver cache and paginate with limit to avoid etcd pressure.
Prefer Protobuf over JSON for non‑CRD resources.
Use Informer (LIST+WATCH) instead of frequent LIST calls.
Throttle client LIST/WATCH frequency.
Introduce relay components to aggregate API requests from DaemonSets or ECI pods.
5.4 Post‑Incident Recovery
Conduct regular disaster‑recovery drills, enforce rate‑limiting on high‑frequency requests, avoid permanent CoreDNS caching, and quickly downgrade offending workloads using fine‑grained API‑server metrics.
06 Summary
K8s is the industry‑standard infrastructure, and Prometheus is the de‑facto monitoring solution. Large‑scale clusters inevitably face risks; Alibaba Cloud continuously learns from incidents like OpenAI’s, refines high‑availability designs, out‑of‑band observability, and provides best‑practice guidance to deliver a more stable and reliable foundation for users.
Alibaba Cloud Observability
Driving continuous progress in observability technology!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
