Why Did Our kube-apiserver OOM? A Deep Dive into Kubernetes Control‑Plane Failures
On September 10 2021, a Kubernetes cluster experienced intermittent kubectl hangs caused by kube-apiserver OOM kills, leading to cascading control-plane failures; this article details the environment, observed metrics, log analysis, code inspection of DeleteCollection, and provides troubleshooting steps to prevent similar incidents.
Cluster and Environment Information
k8s v1.18.4
3 Master nodes, each 8 CPU / 16 GB RAM, 50 GiB SSD
19 Minion nodes with heterogeneous configurations
Control‑plane components (kube‑apiserver, etcd, kube‑controller‑manager, kube‑scheduler) deployed as static pods
VIP load‑balancing for the three kube‑apiserver front‑ends
Tencent Cloud SSD performance ~130 MB/s
Fault Description
On the afternoon of 2021‑09‑10 the kubectl command occasionally hung and could not CRUD standard resources (Pod, Node, etc.). The issue was traced to some kube‑apiserver instances becoming unresponsive.
On‑Site Information
kube‑apiserver pod details (excerpt):
$ kubectl get pods -n kube-system kube-apiserver-x.x.x.x -o yaml
... containerStatuses:
- containerID: docker://xxxxx
lastState:
terminated:
containerID: docker://yyyy
exitCode: 137
finishedAt: "2021-09-10T09:29:02Z"
reason: OOMKilled
startedAt: "2020-12-09T07:02:23Z"
name: kube-apiserver
ready: true
restartCount: 1
started: true
state:
running:
startedAt: "2021-09-10T09:29:08Z"
...On September 10, kube‑apiserver was OOM‑killed.
Surrounding Monitoring
IaaS layer black‑box monitoring (control‑plane hosts):
Effective observations:
Memory, CPU and disk read metrics were positively correlated and dropped sharply on September 10, then returned to normal.
Kube‑apiserver Prometheus metrics:
Effective observations:
kube‑apiserver I/O became problematic; Prometheus failed to scrape metrics for a period.
kube‑apiserver memory grew monotonically, and its workqueue
ADDiops were very high.
Real‑time Debug Information
Effective observations:
Two Master nodes used 80‑90% of memory.
Large amounts of memory were consumed by kube‑apiserver processes.
One Master node showed both high memory and CPU usage, with high kernel‑mode CPU wait (wa).
Almost every process on the machines was heavily reading disks, making shells nearly unusable.
The only Master with relatively low memory consumption (≈8 GiB) had previously been OOM‑killed.
Questions and Hypotheses
Why does kube‑apiserver consume so much memory?
Clients performing full‑list operations on core resources.
etcd failing to serve, causing kube‑apiserver to be unable to provide leader election for other control‑plane components, leading to repeated ListAndWatch loops.
Potential memory leak in kube‑apiserver code.
Why is the etcd cluster malfunctioning?
Network jitter within the etcd cluster.
Disk performance degradation.
Resource starvation on etcd hosts (CPU, RAM) causing insufficient time slices; network descriptor deadlines expire.
Why do kube‑controller‑manager and kube‑scheduler read disks heavily?
They read local configuration files.
When the OS is under extreme memory pressure, large code pages are evicted; when the processes are rescheduled, they must be re‑loaded, increasing I/O.
Relevant Logs
kube‑apiserver logs (excerpt):
I0907 07:04:17.611412 1 trace.go:116] Trace[1140445702]: "Get" url:/apis/storage.k8s.io/v1/volumeattachments/... (total time: 976.1773ms)</code>
<code>E0907 07:04:37.327057 1 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, context canceled]</code>
<code>W0907 07:10:39.496915 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://etcd0:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing context deadline exceeded". Reconnecting...etcd operations became increasingly slow and eventually lost connectivity.
etcd logs (excerpt):
{"level":"warn","ts":"2021-09-10T17:14:50.559+0800","msg":"rejected connection","error":"read tcp 10.0.0.8:2380->10.0.0.42:49824: i/o timeout"}</code>
<code>{"level":"warn","ts":"2021-09-10T17:15:03.961+0800","msg":"rejected connection","error":"EOF"}etcd nodes also experienced connection timeouts and EOF errors.
Deep Investigation
kube‑apiserver Heap‑Profile revealed that the registry
*Store).DeleteCollectionconsumed massive memory. DeleteCollection performs a List followed by concurrent deletions, which can spike memory usage.
Potential goroutine leak scenario: if
e.Deletefails (e.g., etcd error), worker goroutines exit, but the task‑distribution goroutine blocks on sending to the
toProcesschannel, preventing GC of the listed items and causing OOM.
kube‑apiserver goroutine‑profile (excerpt)
goroutine 18970952966 [chan send, 429 minutes]:
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(...)
--
goroutine 18971918521 [chan send, 394 minutes]:
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(...)
...All these goroutines were blocked on channel send, confirming the leak.
kube‑controller‑manager logs (excerpt)
E1027 15:15:01.016712 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-controller-manager: etcdserver: request timed out</code>
<code>I1027 15:15:01.950682 1 leaderelection.go:277] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition</code>
<code>F1027 15:15:01.950760 1 controllermanager.go:279] leaderelection lostLeader election failures were directly caused by kube‑apiserver’s inability to communicate with etcd.
DeleteCollection Implementation (excerpt)
func (e *Store) DeleteCollection(ctx context.Context, deleteValidation rest.ValidateObjectFunc, options *metav1.DeleteOptions, listOptions *metainternalversion.ListOptions) (runtime.Object, error) {
listObj, err := e.List(ctx, listOptions)
if err != nil { return nil, err }
items, err := meta.ExtractList(listObj)
// ... spawn workers, distribute indices via channel, delete each item ...
wg.Wait()
select {
case err := <-errs:
return nil, err
default:
return listObj, nil
}
}If
e.Deleteencounters an etcd error, workers exit but the distributor goroutine remains blocked, preventing GC of
itemsand leading to memory exhaustion.
Summary
Before troubleshooting, define the baseline of a healthy control‑plane (e.g., 100 Node, 1400 Pod, 50 ConfigMap, 300 Event; kube‑apiserver typically uses ~2 GiB RAM and ~10 % single‑core CPU).
Investigation steps:
Detect abnormal behavior.
Identify the failing component and gather its information.
Correlate timestamps in monitoring data to extract CPU, RAM, and disk usage.
Form hypotheses about root causes.
Validate hypotheses with component logs and profiles.
Prevent control‑plane chain failures:
Explicitly limit kube‑apiserver CPU and memory resources to avoid starving etcd.
Deploy the etcd cluster separately from other control‑plane components.
Original article: https://github.com/k8s-club/k8s-club/blob/main/articles/抓虫日志‑kube-apiserver.md
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.