Operations 8 min read

Why Did OpenAI’s New Telemetry Crash Their Kubernetes Cluster?

On December 11, 2024 OpenAI’s Kubernetes cluster suffered a four‑hour outage after a newly deployed telemetry service generated massive API traffic from every node, overwhelming the kube‑apiserver, breaking DNS‑based service discovery, and exposing gaps in control‑plane monitoring and break‑glass mechanisms, prompting critical questions about component behavior and configuration.

System Architect Go

Dec 19, 2024

Why Did OpenAI’s New Telemetry Crash Their Kubernetes Cluster?

Incident Overview

On December 11, 2024, OpenAI’s Kubernetes cluster experienced a failure that impacted services such as the API, ChatGPT, and Sora for a duration of 4 hours and 22 minutes. Although the official post‑mortem and many media reports have covered the incident, the author remains unsatisfied and raises additional questions.

Root Cause Analysis

At 3:12 PM PST, we deployed a new telemetry service to collect detailed Kubernetes control plane metrics.

Telemetry services have a very wide footprint, so this new service’s configuration unintentionally caused every node in each cluster to execute resource‑intensive Kubernetes API operations whose cost scaled with the size of the cluster.

The Kubernetes data plane can operate largely independently of the control plane, but DNS relies on the control plane – services don’t know how to contact one another without the Kubernetes control plane.

In short, the root cause was a new telemetry service configuration that unexpectedly generated massive Kubernetes API load across large clusters, overwhelming the control plane and breaking DNS‑based service discovery.

According to the official information, the root cause was the deployment of a new telemetry service intended to collect detailed control‑plane metrics, which unintentionally caused every node in the cluster to perform resource‑intensive Kubernetes API operations; the larger the number of nodes, the greater the impact on the kube-apiserver, ultimately leading to its failure.

Open Questions

What exactly did the telemetry service do that caused nodes to call the kube-apiserver more frequently?

If the telemetry service’s purpose was to collect control‑plane metrics, why did it involve the nodes?

Which specific component(s) performed the increased operations against the kube-apiserver ?

DNS should not be blamed; its behavior is inherent, and proper usage and configuration are the engineers’ responsibility.

Discovering the problem only after DNS fails is too late; since the new service interacts with the control plane, why wasn’t the control‑plane monitoring more robust?

Services inside the cluster use DNS‑based service discovery; when DNS cache expires they must interact with the kube-apiserver, which was already down, causing service communication failures that are easy to spot (internal domain names cannot be resolved), thus quickly pinpointing the outage.

DNS caching mitigated the impact temporarily by providing stale but functional DNS records. However, as cached records expired over the following 20 minutes, services began failing due to their reliance on real‑time DNS resolution.

This timing was critical because it delayed the visibility of the issue, allowing the rollout to continue before the full scope of the problem was understood.

Once the DNS caches were empty, the load on the DNS servers was multiplied, adding further load to the control plane and further complicating immediate mitigation.

To fix the issue, access to the kube-apiserver was required, but it was down, making remediation impossible.

The kube-apiserver provides the --max-requests-inflight and --max-mutation-requests-inflight flags to limit concurrent requests. Were the engineers unaware of these settings?

The kube-apiserver also offers the API Priority and Fairness (APF) mechanism to prioritize requests, ensuring administrators can still control the server under overload.

OpenAI’s engineers appear to have lacked deep knowledge of these features; were they using an outdated Kubernetes version or heavily customizing it?

What is the “break‑glass” mechanism they mention, and isn’t it essentially what APF does?

Implications and Recommendations

The kube-apiserver is a critical component beyond DNS interactions, so cluster monitoring must pay special attention to it. Additionally, many components can store data directly in etcd in addition to the Kubernetes API; for high availability, consider having components use a separate etcd instance to reduce reliance on the control plane.

References

https://status.openai.com/incidents/ctrsv3lwd797

https://kubernetes.io/docs/concepts/cluster-administration/flow-control/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kubernetes DNS control plane Telemetry incident analysis API overload

Written by

System Architect Go

Programming, architecture, application development, message queues, middleware, databases, containerization, big data, image processing, machine learning, AI, personal growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.