System Architect Go
Dec 19, 2024 · Operations
Why Did OpenAI’s New Telemetry Crash Their Kubernetes Cluster?
On December 11, 2024 OpenAI’s Kubernetes cluster suffered a four‑hour outage after a newly deployed telemetry service generated massive API traffic from every node, overwhelming the kube‑apiserver, breaking DNS‑based service discovery, and exposing gaps in control‑plane monitoring and break‑glass mechanisms, prompting critical questions about component behavior and configuration.
API overloadDNSKubernetes
0 likes · 8 min read
