Cloud Native 15 min read

What Caused OpenAI’s Global Outage? Lessons for Cloud‑Native Observability

The article analyzes the December 11 OpenAI outage, revealing that a newly deployed telemetry service overloaded Kubernetes API servers, breaking DNS resolution and slowing recovery, and compares OpenAI’s approach with LoongCollector/iLogtail’s design to offer stability insights for cloud‑native environments.

Alibaba Cloud Observability

Dec 30, 2024

What Caused OpenAI’s Global Outage? Lessons for Cloud‑Native Observability

Background

On December 11, OpenAI experienced a global outage affecting ChatGPT, API, Sora, Playground, and Labs from 3:16 PM to 7:38 PM PST, lasting over four hours.

Root Cause Analysis

Root Cause
OpenAI operates hundreds of Kubernetes clusters globally. Kubernetes has a control plane responsible for cluster administration and a data plane from where we actually serve workloads like model inference.

As part of a push to improve reliability across the organization, we’ve been working to improve our cluster-wide observability tooling to strengthen visibility into the state of our systems. At 3:12 PM PST, we deployed a new telemetry service to collect detailed Kubernetes control plane metrics.

Telemetry services have a very wide footprint, so this new service’s configuration unintentionally caused every node in each cluster to execute resource‑intensive Kubernetes API operations whose cost scaled with the size of the cluster. With thousands of nodes performing these operations simultaneously, the Kubernetes API servers became overwhelmed, taking down the control plane in most of our large clusters. This issue was most pronounced in our largest clusters, so our testing didn’t catch it – and DNS caching made the issue far less visible until the rollouts had begun fleet‑wide.

The Kubernetes data plane can operate largely independently of the control plane, but DNS relies on the control plane – services don’t know how to contact one another without the Kubernetes control plane.

In short, the root cause was a new telemetry service configuration that unexpectedly generated massive Kubernetes API load across large clusters, overwhelming the control plane and breaking DNS‑based service discovery.

In one sentence: a newly deployed observability service generated massive requests to the K8s API Server, raising its load, causing DNS resolution failures and impacting data‑plane functionality.

Key Questions

Why did the observability service generate such a large number of API Server requests?

How does the API Server affect DNS resolution?

Why was fault recovery so slow?

K8s Scenario: Why Observability Services Need API Server Access

Pod information, metric collection, and metadata association often require querying the API Server. For example, Prometheus Operator uses PodMonitor/ServiceMonitor resources, which trigger API Server requests to discover target Pods/Services.

OpenAI’s Approach

OpenAI’s telemetry service was deployed cluster‑wide (similar to a DaemonSet). Each node’s pod performed List‑Watch operations on many resources, creating a massive number of watch goroutines. In large clusters (hundreds or thousands of nodes), this put significant pressure on the API Server, leading to overload.

LoongCollector/iLogtail’s Approach

LoongCollector follows an All‑in‑One design, using a single agent for logs, metrics, traces, events, and profiles. To reduce API Server impact, it:

Obtains container metadata directly from the container runtime on each node.

Uses a single replica List‑Watch for metric collection, avoiding per‑node API requests.

Why the API Server Affects DNS Resolution

Kubernetes DNS (CoreDNS) watches Service and Endpoint objects via the API Server. If the API Server is overloaded or unavailable, DNS records cannot be updated, causing service discovery failures.

Why Fault Recovery Was Slow

Even after identifying the problematic observability service, removing it required kubectl delete, but the API Server itself was unresponsive, preventing management commands. OpenAI’s recovery steps included:

Scaling down the cluster to reduce overall API load.

Blocking network access to the administrator API to stop new high‑cost requests.

Expanding API Server resources to handle pending requests.

After deleting the faulty telemetry service, the cluster returned to normal.

Lessons for Stability

Design observability solutions to minimize impact on core components like the API Server and DNS. Monitor resource consumption (CPU, memory, handles) and avoid circular dependencies between services. Conduct stress testing for large‑scale clusters to ensure the architecture remains performant.