Cloud Native 21 min read

Why Did Our kube-apiserver OOM? A Deep Dive into Kubernetes Control‑Plane Failures

On September 10 2021, a Kubernetes cluster experienced intermittent kubectl hangs caused by kube-apiserver OOM kills, leading to cascading control-plane failures; this article details the environment, observed metrics, log analysis, code inspection of DeleteCollection, and provides troubleshooting steps to prevent similar incidents.

Open Source Linux

Oct 14, 2022

Why Did Our kube-apiserver OOM? A Deep Dive into Kubernetes Control‑Plane Failures

Cluster and Environment Information

k8s v1.18.4

3 Master nodes, each 8 CPU / 16 GB RAM, 50 GiB SSD

19 Minion nodes with heterogeneous configurations

Control‑plane components (kube‑apiserver, etcd, kube‑controller‑manager, kube‑scheduler) deployed as static pods

VIP load‑balancing for the three kube‑apiserver front‑ends

Tencent Cloud SSD performance ~130 MB/s

Fault Description

On the afternoon of 2021‑09‑10 the kubectl command occasionally hung and could not CRUD standard resources (Pod, Node, etc.). The issue was traced to some kube‑apiserver instances becoming unresponsive.

On‑Site Information

kube‑apiserver pod details (excerpt):

$ kubectl get pods -n kube-system kube-apiserver-x.x.x.x -o yaml
... containerStatuses:
  - containerID: docker://xxxxx
    lastState:
      terminated:
        containerID: docker://yyyy
        exitCode: 137
        finishedAt: "2021-09-10T09:29:02Z"
        reason: OOMKilled
        startedAt: "2020-12-09T07:02:23Z"
    name: kube-apiserver
    ready: true
    restartCount: 1
    started: true
    state:
      running:
        startedAt: "2021-09-10T09:29:08Z"
...

On September 10, kube‑apiserver was OOM‑killed.

Surrounding Monitoring

IaaS layer black‑box monitoring (control‑plane hosts):

Effective observations:

Memory, CPU and disk read metrics were positively correlated and dropped sharply on September 10, then returned to normal.

Kube‑apiserver Prometheus metrics:

Effective observations:

kube‑apiserver I/O became problematic; Prometheus failed to scrape metrics for a period.

kube‑apiserver memory grew monotonically, and its workqueue ADD iops were very high.

Real‑time Debug Information

Effective observations:

Two Master nodes used 80‑90% of memory.

Large amounts of memory were consumed by kube‑apiserver processes.

One Master node showed both high memory and CPU usage, with high kernel‑mode CPU wait (wa).

Almost every process on the machines was heavily reading disks, making shells nearly unusable.

The only Master with relatively low memory consumption (≈8 GiB) had previously been OOM‑killed.

Questions and Hypotheses

Why does kube‑apiserver consume so much memory?

Clients performing full‑list operations on core resources.

etcd failing to serve, causing kube‑apiserver to be unable to provide leader election for other control‑plane components, leading to repeated ListAndWatch loops.

Potential memory leak in kube‑apiserver code.

Why is the etcd cluster malfunctioning?

Network jitter within the etcd cluster.

Disk performance degradation.

Resource starvation on etcd hosts (CPU, RAM) causing insufficient time slices; network descriptor deadlines expire.

Why do kube‑controller‑manager and kube‑scheduler read disks heavily?

They read local configuration files.

When the OS is under extreme memory pressure, large code pages are evicted; when the processes are rescheduled, they must be re‑loaded, increasing I/O.

Relevant Logs

kube‑apiserver logs (excerpt):

I0907 07:04:17.611412 1 trace.go:116] Trace[1140445702]: "Get" url:/apis/storage.k8s.io/v1/volumeattachments/... (total time: 976.1773ms)</code>
<code>E0907 07:04:37.327057 1 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, context canceled]</code>
<code>W0907 07:10:39.496915 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://etcd0:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing context deadline exceeded". Reconnecting...

etcd operations became increasingly slow and eventually lost connectivity.

etcd logs (excerpt):

{"level":"warn","ts":"2021-09-10T17:14:50.559+0800","msg":"rejected connection","error":"read tcp 10.0.0.8:2380->10.0.0.42:49824: i/o timeout"}</code>
<code>{"level":"warn","ts":"2021-09-10T17:15:03.961+0800","msg":"rejected connection","error":"EOF"}

etcd nodes also experienced connection timeouts and EOF errors.

Deep Investigation

kube‑apiserver Heap‑Profile revealed that the registry *Store).DeleteCollection consumed massive memory. DeleteCollection performs a List followed by concurrent deletions, which can spike memory usage.

Potential goroutine leak scenario: if e.Delete fails (e.g., etcd error), worker goroutines exit, but the task‑distribution goroutine blocks on sending to the toProcess channel, preventing GC of the listed items and causing OOM.

kube‑apiserver goroutine‑profile (excerpt)

goroutine 18970952966 [chan send, 429 minutes]:
 k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(...)
--
goroutine 18971918521 [chan send, 394 minutes]:
 k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(...)
...

All these goroutines were blocked on channel send, confirming the leak.

kube‑controller‑manager logs (excerpt)

E1027 15:15:01.016712 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-controller-manager: etcdserver: request timed out</code>
<code>I1027 15:15:01.950682 1 leaderelection.go:277] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition</code>
<code>F1027 15:15:01.950760 1 controllermanager.go:279] leaderelection lost

Leader election failures were directly caused by kube‑apiserver’s inability to communicate with etcd.

DeleteCollection Implementation (excerpt)

func (e *Store) DeleteCollection(ctx context.Context, deleteValidation rest.ValidateObjectFunc, options *metav1.DeleteOptions, listOptions *metainternalversion.ListOptions) (runtime.Object, error) {
    listObj, err := e.List(ctx, listOptions)
    if err != nil { return nil, err }
    items, err := meta.ExtractList(listObj)
    // ... spawn workers, distribute indices via channel, delete each item ...
    wg.Wait()
    select {
    case err := <-errs:
        return nil, err
    default:
        return listObj, nil
    }
}

If e.Delete encounters an etcd error, workers exit but the distributor goroutine remains blocked, preventing GC of items and leading to memory exhaustion.

Summary

Before troubleshooting, define the baseline of a healthy control‑plane (e.g., 100 Node, 1400 Pod, 50 ConfigMap, 300 Event; kube‑apiserver typically uses ~2 GiB RAM and ~10 % single‑core CPU).

Investigation steps:

Detect abnormal behavior.

Identify the failing component and gather its information.

Correlate timestamps in monitoring data to extract CPU, RAM, and disk usage.

Form hypotheses about root causes.

Validate hypotheses with component logs and profiles.

Prevent control‑plane chain failures:

Explicitly limit kube‑apiserver CPU and memory resources to avoid starving etcd.

Deploy the etcd cluster separately from other control‑plane components.

Original article: https://github.com/k8s-club/k8s-club/blob/main/articles/抓虫日志‑kube-apiserver.md

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud-native Troubleshooting OOM etcd kube-apiserver

Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.