Operations 8 min read

What Caused OpenAI’s Massive Outage? Inside the Kubernetes Failure and Recovery

On December 11, OpenAI suffered a severe outage across ChatGPT, its API, and Sora due to a misconfigured telemetry service that overloaded Kubernetes control planes worldwide, prompting a cascade of failures and a coordinated recovery effort.

Efficient Ops
Efficient Ops
Efficient Ops
What Caused OpenAI’s Massive Outage? Inside the Kubernetes Failure and Recovery

On December 11, all OpenAI services—including ChatGPT, the API, and Sora—experienced a major performance degradation and complete outage between 3:16 PM PST and 7:38 PM PST.

The incident was triggered by a misconfiguration in a newly deployed telemetry service, which caused every node in hundreds of global Kubernetes (K8s) clusters to execute resource‑intensive API operations simultaneously, overwhelming the control planes.

Impact

ChatGPT began significant recovery at 5:45 PM PST and was fully restored by 7:01 PM PST.

API services started recovering at 5:36 PM PST and were completely back online by 7:38 PM PST.

Sora was fully restored at 7:01 PM PST.

Root Cause

OpenAI runs hundreds of Kubernetes clusters worldwide. The new telemetry service’s configuration caused all nodes to perform high‑load Kubernetes API calls at once, especially in large clusters.

These calls overloaded the Kubernetes API servers, crippling the control plane. Because the data‑plane DNS service depends on the control plane, the data‑plane services also failed, leading to an avalanche‑style outage.

Insufficient service decoupling: data‑plane services relied on control‑plane availability.

Configuration error: the telemetry service triggered simultaneous high‑load API operations.

Incomplete deployment process: lack of staged rollout mechanisms.

Missing fault‑injection testing: no error‑injection or disaster‑recovery drills.

Restricted emergency access: engineers could not quickly roll back via the control plane.

Emergency Measures

Reduced cluster size to lower Kubernetes API load.

Blocked network access to the Kubernetes management API to stop new high‑load requests.

Scaled up the Kubernetes API servers to handle pending requests.

Removed the problematic telemetry service.

Shifted traffic away from affected clusters to gradually restore service.

Timeline

Dec 10: New telemetry service deployed to staging clusters and passed verification. Dec 11 2:23 PM PST: Change introduced into deployment pipeline. 2:51 – 3:20 PM PST: Change applied to all clusters. 3:13 PM PST: Alert fired, engineers notified. 3:16 PM PST: Minor impact on customers begins. 3:16 PM PST: Root cause identified. 3:27 PM PST: Engineers start moving traffic out of affected clusters. 3:40 PM PST: Customer impact peaks. 4:36 PM PST: First cluster successfully recovers. 7:38 PM PST: All clusters fully recovered.

Preventive Measures

Implement robust staged release mechanisms and continuous monitoring of infrastructure changes.

Introduce fault‑injection testing, including control‑plane failures and malicious changes, to ensure rapid detection and rollback.

Establish emergency access to the Kubernetes control plane so engineers can reach the API server even under heavy load.

Decouple the Kubernetes data plane from the control plane, reducing DNS‑related dependencies.

Accelerate recovery by improving caching, dynamic rate limiting, and conducting regular disaster‑recovery drills.

Summary

OpenAI apologizes for the widespread impact on ChatGPT users, developers, and enterprises, acknowledging the importance of high‑reliability services and committing to the above preventive actions to improve future reliability.

Kubernetesincident managementOpenAIcloud operationsoutage
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.