Cloud Native 21 min read

Ensuring Stability of Large‑Scale Kubernetes Clusters: Lessons from the OpenAI Incident and Alibaba Cloud Practices

This article analyses the OpenAI large‑scale Kubernetes outage, explains the inherent risks of massive K8s clusters, and presents Alibaba Cloud's architectural enhancements, observability improvements, and best‑practice guidelines to achieve high‑availability and reliable operation of thousands‑node Kubernetes environments.

Alibaba Cloud Infrastructure

Dec 25, 2024

Ensuring Stability of Large‑Scale Kubernetes Clusters: Lessons from the OpenAI Incident and Alibaba Cloud Practices

1. Introduction

Kubernetes has become the de‑facto standard for modern IT infrastructure, with official recommendations limiting clusters to 5,000 nodes; however, OpenAI successfully ran a 7,500‑node cluster, exposing scalability bottlenecks and fragility in distributed systems.

The recent OpenAI outage was triggered when a new observability feature generated massive API traffic from every node, overwhelming the Kubernetes control plane, causing DNS failures and a cascade of service disruptions.

2. Risks and Challenges of Large‑Scale Clusters

When a cluster grows to thousands of nodes, any component—control‑plane API server, scheduler, etcd, or custom controllers—can become a performance choke point, leading to instability and reduced resilience.

3. Stability Enhancements from Alibaba Cloud

Alibaba Cloud’s Container Service (ACK) adopts high‑availability designs such as multi‑AZ, multi‑node control‑plane replication, VPA/HPA for API servers, and resource‑isolated etcd configurations, achieving SLA 99.95 % in three‑AZ regions.

Prometheus monitoring is reinforced with a two‑tier architecture: a Master role handling service discovery, scheduling, and configuration distribution, and a Worker role performing metric collection and reporting. This decouples load from the API server, uses protobuf for efficient data transfer, and scales out without increasing API pressure.

Exporters (e.g., kube‑state‑metrics) are rewritten to run single‑instance with columnar in‑memory storage, eliminating the need for multiple List‑Watch replicas that previously stressed the API server.

Out‑of‑band data links are introduced so monitoring components run outside the customer’s cluster, guaranteeing observability even when the in‑cluster control plane is unavailable.

4. Best Practices for Users

• Capacity planning: split workloads across multiple clusters to keep node counts manageable.

• Safe release processes: employ canary, rollback, and continuous monitoring; pause between canary batches to verify behavior.

• Observability: enable ACK control‑plane dashboards, log monitoring, and pre‑configured alert rules for API‑server load, SLB bandwidth, and component health.

• API usage guidelines: use resourceVersion=0 for full LISTs, paginate with limit, prefer protobuf over JSON for non‑CRD resources, and adopt Informer mechanisms to replace frequent LIST‑Watch cycles.

• Rate limiting: throttle high‑frequency List/Watch calls, especially from DaemonSets or custom controllers.

• Relay components: introduce horizontal‑scalable proxies (e.g., Poseidon) to aggregate API requests, reducing direct load on the control plane.

• Emergency response: conduct regular disaster‑recovery drills, configure APF rate‑limit rules, avoid permanent CoreDNS caching, and quickly isolate “culprit” workloads via fine‑grained control‑plane metrics.

5. Conclusion

Large‑scale Kubernetes deployments inevitably face performance limits, but with cloud‑native high‑availability architectures, robust observability, and disciplined operational practices—as demonstrated by Alibaba Cloud—organizations can achieve stable, reliable service even at thousands of nodes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Observability High Availability Kubernetes Prometheus Large-Scale Clusters

Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.