Operations 5 min read

How to Diagnose and Fix Node2 cgroup Leak in a Kubernetes Cluster with Ceph

This article walks through diagnosing a Kubernetes node2 cgroup leak caused by Ceph storage inconsistencies, detailing step‑by‑step investigations, Ceph and pod repairs, node maintenance commands, and reflections on preventing similar issues in future clusters.

MaGe Linux Operations

Nov 22, 2021

How to Diagnose and Fix Node2 cgroup Leak in a Kubernetes Cluster with Ceph

Background

Received an alert from the test environment cluster and logged into the Kubernetes cluster for investigation.

Fault Location

Pod Inspection

Observed abnormal Calico pods on the node2 in the kube-system namespace.

Detailed view showed that node2 had no storage space and a cgroup leak.

Checked node2 storage; despite apparent ample space, the distributed storage used by the cluster is Ceph, so the Ceph cluster status was examined.

Operations

Ceph Repair

The Ceph cluster showed anomalies that could cause the node2 cgroup leak, so a manual Ceph repair was performed.

Data inconsistency (incorrect object size or missing objects after recovery) leads to scrub errors.

Ceph may encounter mismatched object size information during storage, causing cleanup failures.

ceph pg repair 1.7c

After repair, the Ceph cluster recovered.

Pod Repair

Abnormal pods were deleted; the controller automatically recreated the latest pods.

Further analysis suggested the Ceph issue caused the node2 cgroup leak, prompting a kernel recompilation.

Similar issues were found in GitHub issue #313 , often due to a low‑version Linux kernel (e.g., 3.10.0‑862.el7.x86_64) on the host.

Kubelet host kernel version too low.

Can be mitigated by disabling kmem accounting.

The kernel version was indeed low.

Node2 Maintenance

Mark node2 as unschedulable kubectl cordon node02 Drain pods from node2

kubectl drain node02 --delete-local-data --ignore-daemonsets --force

--delete-local-data

removes local data, including emptyDir. --ignore-daemonsets prevents DaemonSet pods from being recreated. --force forces deletion of all pod controllers.

All pods were successfully evicted.

Reboot node02

After reboot, node02 recovered and was marked schedulable again.

kubectl uncordon node02

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kubernetes Ceph Cluster Troubleshooting cgroup leak Node Maintenance

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.