How a New Telemetry Service Overwhelmed OpenAI’s Kubernetes API Server
An in‑depth post‑mortem reveals how OpenAI’s newly deployed telemetry service generated massive Kubernetes API requests, overloading the API server, breaking DNS resolution, and slowing recovery, while contrasting OpenAI’s approach with LoongCollector/iLogtail’s design to minimize API load and improve cluster stability.
On December 11, OpenAI suffered a global outage affecting ChatGPT, its API, Sora, Playground, and Labs, lasting over four hours. The official post‑mortem identifies the root cause as a newly deployed telemetry service whose configuration caused every node in each Kubernetes cluster to execute resource‑intensive API operations, overwhelming the API servers and breaking DNS‑based service discovery.
Root Cause
OpenAI operates hundreds of Kubernetes clusters globally. Kubernetes has a control plane responsible for cluster administration and a data plane from where we actually serve workloads like model inference.
As part of a push to improve reliability across the organization, we’ve been working to improve our cluster‑wide observability tooling to strengthen visibility into the state of our systems. At 3:12 PM PST, we deployed a new telemetry service to collect detailed Kubernetes control plane metrics.
Telemetry services have a very wide footprint, so this new service’s configuration unintentionally caused every node in each cluster to execute resource‑intensive Kubernetes API operations whose cost scaled with the size of the cluster. With thousands of nodes performing these operations simultaneously, the Kubernetes API servers became overwhelmed, taking down the Kubernetes control plane in most of our large clusters. This issue was most pronounced in our largest clusters, so our testing didn’t catch it – and DNS caching made the issue far less visible until the rollouts had begun fleet‑wide.
The Kubernetes data plane can operate largely independently of the control plane, but DNS relies on the control plane – services don’t know how to contact one another without the Kubernetes control plane.
In short, the root cause was a new telemetry service configuration that unexpectedly generated massive Kubernetes API load across large clusters, overwhelming the control plane and breaking DNS‑based service discovery.In one sentence: a newly deployed observability service generated a flood of requests to the K8s API Server, raising its load, ultimately breaking DNS resolution and affecting data‑plane functionality.
Key Questions
Why does an observability service generate so many API Server requests?
How does the API Server impact DNS resolution?
Why is fault recovery so slow?
Kubernetes Context: Why Observability Services Need the API Server
Observability data collection often requires fetching pod metadata, which is obtained via API Server calls (e.g., List‑Watch of Pods, Services, Endpoints). This metadata is essential for correlating logs, metrics, and traces with the originating containers.
OpenAI’s Approach
The telemetry service was deployed as a DaemonSet‑like component, running on every node and performing extensive List‑Watch operations on many resource types. In large clusters (hundreds to thousands of nodes), each watch spawns two Go routines, processes all object changes, and serializes updates to clients, creating massive pressure on the API Server.
Testing missed the issue because the staging cluster was too small to reproduce the load, and DNS caching masked the failure until the rollout was fleet‑wide.
LoongCollector/iLogtail’s Approach
LoongCollector follows an “All‑in‑One” design, using a single agent to collect logs, metrics, traces, events, and profiles. It reduces API Server impact by:
Interacting directly with the container runtime on each node to obtain basic metadata, avoiding per‑node API calls.
Using a single‑replica List‑Watch for cluster‑wide metric collection, eliminating the massive request burst.
Why the API Server Affects DNS Resolution
Kubernetes DNS (CoreDNS) watches Service and Endpoint objects via the API Server. When the API Server is down, CoreDNS cannot receive updates, causing stale DNS records and service discovery failures.
Why Recovery Was Slow
Even after identifying the problematic telemetry service, removing it required kubectl delete, but the API Server itself was unresponsive, preventing the command from executing. Recovery steps included scaling down the cluster, blocking external API traffic, and temporarily expanding API Server resources to process pending requests.
Takeaways for iLogtail/LoongCollector Stability
Understand the impact of your service on core cluster components and minimize resource consumption.
Avoid circular dependencies between services.
Separate data‑plane and control‑plane responsibilities.
Conduct pressure testing at realistic cluster scales to ensure stability.
References:
OpenAI incident report
Kubernetes source code
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
