Why Did Our kube-apiserver OOM? A Deep Dive into Kubernetes Control-Plane Failures
This article details a real-world Kubernetes control‑plane outage where kube‑apiserver repeatedly OOM‑killed, explores cluster metrics, logs, heap and goroutine profiles, hypothesizes root causes such as etcd latency and DeleteCollection memory leaks, and offers step‑by‑step troubleshooting and prevention guidance.
Cluster and environment information:
k8s v1.18.4
3 master nodes, each 8 CPU / 16 GB RAM, 50 Gi‑SSD
19 heterogeneous minion nodes
Control‑plane components (kube‑apiserver, etcd, kube‑controller‑manager, kube‑scheduler) deployed as static pods
VIP load‑balances traffic to the three kube‑apiserver front‑ends
Tencent Cloud SSD performance ~130 MB/s
Fault description
On the afternoon of 2022‑09‑10, kubectl occasionally hung and could not CRUD standard resources (Pod, Node, etc.). The issue was traced to some kube‑apiserver instances becoming unresponsive.
On‑site information
kube‑apiserver pod details (kube‑system namespace):
<code>$ kubectl get pods -n kube-system kube-apiserver-x.x.x.x -o yaml</code><code>...</code><code>containerStatuses:</code><code>- containerID: docker://xxxxx</code><code>...</code><code>lastState:</code><code>terminated:</code><code>containerID: docker://yyyy</code><code>exitCode: 137</code><code>finishedAt: "2021-09-10T09:29:02Z"</code><code>reason: OOMKilled</code><code>startedAt: "2020-12-09T07:02:23Z"</code><code>name: kube-apiserver</code><code>ready: true</code><code>restartCount: 1</code><code>started: true</code><code>state:</code><code>running:</code><code>startedAt: "2021-09-10T09:29:08Z"</code><code>...</code>9 September: kube‑apiserver was OOM‑killed.
Surrounding monitoring
IaaS layer black‑box monitoring (control‑plane hosts):
Effective information:
Memory, CPU, and disk read metrics were positively correlated and sharply dropped after 9 September, returning to normal.
Kube‑apiserver Prometheus monitoring:
Effective information:
kube‑apiserver I/O problems: Prometheus failed to scrape metrics for a period.
kube‑apiserver memory grew monotonically; its internal workqueue
ADDiops were very high.
Real‑time debug information:
Effective information:
Both master nodes’ memory usage reached ~80‑90%.
kube‑apiserver processes consumed most of the memory.
One master’s CPU was saturated with high kernel‑mode wait (wa).
Almost every process on the machines was heavily reading disks, making shells nearly unusable.
The only relatively low‑memory master (8 Gi) had previously been OOM‑killed.
Some questions and hypotheses
Why does kube‑apiserver consume a lot of memory?
Clients performing full‑list operations on core resources.
etcd unable to serve, causing kube‑apiserver to fail leader election for other control‑plane components, leading to repeated ListAndWatch loops.
Potential memory leak in kube‑apiserver code.
Why is the etcd cluster malfunctioning?
Network jitter within the etcd cluster.
Disk performance degradation.
Insufficient CPU/RAM on etcd hosts, causing limited time‑slice allocation and deadline expirations.
Why do kube‑controller‑manager and kube‑scheduler read disks heavily?
They read local configuration files.
Under extreme memory pressure the OS evicts pages of large processes; when rescheduled they reload from disk, increasing I/O.
Some logs
kube‑apiserver related logs:
<code>I0907 07:04:17.611412 1 trace.go:116] Trace[1140445702]: "Get" url:/apis/storage.k8s.io/v1/volumeattachments/... (total time: 976.1773ms):</code><code>Trace[1140445702]: [976.164659ms] About to write a response</code><code>I0907 07:04:17.611478 1 trace.go:116] Trace[630463685]: "Get" url:/apis/storage.k8s.io/v1/volumeattachments/... (total time: 983.823847ms):</code><code>Trace[630463685]: [983.812225ms] About to write a response</code><code>...</code><code>E0907 07:04:37.327057 1 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, context canceled]</code><code>W0907 07:10:39.496915 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://etcd0:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing context deadline exceeded". Reconnecting...</code>etcd operation latency increased dramatically and eventually lost connection.
etcd logs (partial):
<code>{"level":"warn","ts":"2021-09-10T17:14:50.559+0800","msg":"rejected connection","remote-addr":"10.0.0.42:49824","error":"read tcp 10.0.0.8:2380->10.0.0.42:49824: i/o timeout"}</code><code>{"level":"warn","ts":"2021-09-10T17:14:58.993+0800","msg":"rejected connection","remote-addr":"10.0.0.6:54656","error":"EOF"}</code><code>...</code>etcd communication with the node was abnormal, preventing it from serving.
Deep investigation
Heap profile of kube‑apiserver shows massive memory consumption by
registry(*Store).DeleteCollection, which first lists items then deletes them concurrently, explaining the sudden memory spike.
If
e.Deletefails (as in our etcd error scenario), worker goroutines exit but the dispatcher goroutine blocks on sending to the
toProcesschannel, preventing garbage collection of the retrieved items and causing OOM.
kube‑apiserver goroutine‑profile
<code>goroutine 18970952966 [chan send, 429 minutes]: k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(...)</code><code>...</code><code># ... many more ...</code>Most goroutines are blocked on channel send, indicating the dispatcher deadlock.
kube‑controller‑manager logs
<code>E1027 15:15:01.016712 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-controller-manager: etcdserver: request timed out</code><code>I1027 15:15:01.950682 1 leaderelection.go:277] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition</code><code>F1027 15:15:01.950760 1 controllermanager.go:279] leaderelection lost</code>The controller manager could not renew its lease because kube‑apiserver failed to communicate with etcd.
kube‑apiserver DeleteCollection implementation
<code>func (e *Store) DeleteCollection(ctx context.Context, deleteValidation rest.ValidateObjectFunc, options *metav1.DeleteOptions, listOptions *metainternalversion.ListOptions) (runtime.Object, error) {</code><code> listObj, err := e.List(ctx, listOptions)</code><code> if err != nil { return nil, err }</code><code> items, err := meta.ExtractList(listObj)</code><code> wg := sync.WaitGroup{}</code><code> toProcess := make(chan int, 2*workersNumber)</code><code> errs := make(chan error, workersNumber+1)</code><code> // dispatcher goroutine</code><code> wg.Add(workersNumber)</code><code> for i := 0; i < workersNumber; i++ { go func() { defer wg.Done(); for index := range toProcess { if _, _, err := e.Delete(ctx, accessor.GetName(), deleteValidation, options.DeepCopy()); err != nil && !apierrors.IsNotFound(err) { errs <- err; return } } }() }</code><code> wg.Wait()</code><code> select { case err := <-errs: return nil, err default: return listObj, nil }</code><code>}</code>If e.Delete errors (as with etcd failures), the dispatcher goroutine blocks on sending to toProcess , preventing GC of the listed items and leading to OOM.
Summary
Define a clear baseline for a healthy cluster (e.g., 100 nodes, 1400 pods, 50 ConfigMaps, 300 events, kube‑apiserver ~2 Gi memory, ~10 % single‑core CPU).
Detect anomalies via monitoring and logs, then pinpoint the failing component.
Correlate timestamps of abnormal CPU, RAM, and disk usage with component logs and profiles.
Form hypotheses, validate with heap and goroutine profiles, and examine source code.
Prevent control‑plane cascade failures by allocating sufficient resources to kube‑apiserver, isolating etcd clusters, and monitoring DeleteCollection activity.
Original article: k8s‑club/kube‑apiserver.md
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.