Cloud Native 12 min read

How to Fix Kubernetes Memory Leaks and Expired Certificates: Step‑by‑Step Guide

This article walks through diagnosing and resolving two common Kubernetes issues—node memory leaks caused by kmem accounting and expired cluster certificates—by showing how to detect the problems, rebuild runc and kubelet, and extend certificate validity using kubeadm and manifest edits.

dbaplus Community

Apr 22, 2024

How to Fix Kubernetes Memory Leaks and Expired Certificates: Step‑by‑Step Guide

Problem 1: Kubernetes Memory Leak

When a Kubernetes cluster runs for a long time, some nodes may stop creating new Pods and report errors such as

applying cgroup … caused: mkdir … no space left on device

or cannot allocate memory. This indicates a memory leak in the cgroup subsystem.

To verify the leak, inspect /sys/fs/cgroup/memory/kubepods/memory.kmem.slabinfo. If the file returns an I/O error, the leak is absent; otherwise, slabinfo entries will be present, confirming the leak.

Solution Overview

The leak originates from kmem accounting, which creates cgroup entries that are not reclaimed after deletion, eventually exhausting the limit of 65535 entries. The fix is to rebuild runc and kubelet without kmem accounting.

Step‑by‑Step Implementation

Prepare Go environment

wget https://dl.google.com/go/go1.12.9.linux-amd64.tar.gz
tar xf go1.12.9.linux-amd64.tar.gz -C /usr/local/
# Add to bashrc
export GOPATH="/data/Documents"
export GOROOT="/usr/local/go"
export PATH="$GOROOT/bin:$GOPATH/bin:$PATH"
export GO111MODULE=off
source ~/.bashrc
go env

Clone and compile runc

mkdir -p /data/Documents/src/github.com/opencontainers/
cd /data/Documents/src/github.com/opencontainers/
git clone https://github.com/opencontainers/runc
cd runc
git checkout v1.0.0-rc9
sudo yum install libseccomp-devel
make BUILDTAGS='seccomp nokmem'
# The resulting binary is the new runc executable

Clone and compile kubelet

mkdir -p /root/k8s/
cd /root/k8s/
git clone https://github.com/kubernetes/kubernetes
cd kubernetes
git checkout v1.15.3
# Build a Docker image with Go environment
cat > Dockerfile <<'EOF'
FROM centos:7.3.1611
ENV GOROOT /usr/local/go
ENV GOPATH /usr/local/gopath
ENV PATH /usr/local/go/bin:$PATH
RUN yum install -y rpm-build which where rsync gcc gcc-c++ automake autoconf libtool make \
    && curl -L https://studygolang.com/dl/golang/go1.12.9.linux-amd64.tar.gz | tar zxvf - -C /usr/local
EOF
docker build -t build-k8s:centos-7.3-go-1.12.9-k8s-1.15.3 .
# Compile inside container
docker run -it --rm -v /root/k8s/kubernetes:/usr/local/gopath/src/k8s.io/kubernetes build-k8s:centos-7.3-go-1.12.9-k8s-1.15.3 bash -c "GO111MODULE=off KUBE_GIT_TREE_STATE=clean KUBE_GIT_VERSION=v1.15.3 make kubelet GOFLAGS='-tags=nokmem'"

Replace binaries

# Backup existing binaries
mv /usr/bin/kubelet /home/kubelet
mv /usr/bin/docker-runc /home/docker-runc
# Stop services
systemctl stop docker
systemctl stop kubelet
# Copy new binaries
cp kubelet /usr/bin/kubelet
cp kubelet /usr/local/bin/kubelet
cp runc /usr/bin/docker-runc

Verify the fix

cat /sys/fs/cgroup/memory/kubepods/burstable/memory.kmem.usage_in_bytes   # should be 0
cat /sys/fs/cgroup/memory/kubepods/memory.kmem.slabinfo            # should show no leak

Problem 2: Kubernetes Certificate Expiration

In a long‑running cluster, the API server may become unreachable with the error

Unable to connect to the server: x509: certificate has expired or is not yet valid

. The cause is expired control‑plane certificates.

Detection

kubeadm alpha certs check-expiration

The command confirms that certificates have expired.

Renewal Using kubeadm

# Renew all certificates
kubeadm alpha certs renew all --config=kubeadm.yaml
systemctl restart kubelet
# Regenerate kubeconfig files
kubeadm init phase kubeconfig all --config kubeadm.yaml

After renewal, replace the static pod manifests for the control‑plane components and restart them.

Extending Certificate Validity to 10 Years

To avoid frequent renewals, add the experimental flag to the controller‑manager manifest:

spec:
  containers:
  - command:
    - kube-controller-manager
    - --experimental-cluster-signing-duration=87600h   # 10 years
    - --client-ca-file=/etc/kubernetes/pki/ca.crt

The controller‑manager will restart automatically and use the new duration.

Updating Dependent Components

Because etcd and the API server reference the CA certificate, update their manifests accordingly.

# Update etcd manifest
cp -r /etc/kubernetes/manifests/ /etc/kubernetes/manifests.bak
vi /etc/kubernetes/manifests/etcd.yaml
# Change to use the default CA
- --peer-trusted-ca-file=/etc/kubernetes/pki/ca.crt
- --trusted-ca-file=/etc/kubernetes/pki/ca.crt
# Mount the CA directory into etcd
- mountPath: /etc/kubernetes/pki
  name: etcd-certs

# Update kube‑apiserver manifest
vi /etc/kubernetes/manifests/kube-apiserver.yaml
- --etcd-cafile=/etc/kubernetes/pki/ca.crt

Replace the front‑proxy CA files so that aggregated APIs (e.g., metrics‑server) continue to work:

cp /etc/kubernetes/pki/ca.crt /etc/kubernetes/pki/front-proxy-ca.crt
cp /etc/kubernetes/pki/ca.key /etc/kubernetes/pki/front-proxy-ca.key

After these changes, the cluster runs with a ten‑year certificate validity.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

kubernetes memory-leak kubelet runc certificate-renewal

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.