How to Diagnose and Fix Node2 cgroup Leak in a Kubernetes Cluster with Ceph
This article walks through diagnosing a Kubernetes node2 cgroup leak caused by Ceph storage inconsistencies, detailing step‑by‑step investigations, Ceph and pod repairs, node maintenance commands, and reflections on preventing similar issues in future clusters.
Background
Received an alert from the test environment cluster and logged into the Kubernetes cluster for investigation.
Fault Location
Pod Inspection
Observed abnormal Calico pods on the node2 in the kube-system namespace.
Detailed view showed that node2 had no storage space and a cgroup leak.
Checked node2 storage; despite apparent ample space, the distributed storage used by the cluster is Ceph, so the Ceph cluster status was examined.
Operations
Ceph Repair
The Ceph cluster showed anomalies that could cause the node2 cgroup leak, so a manual Ceph repair was performed.
Data inconsistency (incorrect object size or missing objects after recovery) leads to scrub errors.
Ceph may encounter mismatched object size information during storage, causing cleanup failures.
ceph pg repair 1.7cAfter repair, the Ceph cluster recovered.
Pod Repair
Abnormal pods were deleted; the controller automatically recreated the latest pods.
Further analysis suggested the Ceph issue caused the node2 cgroup leak, prompting a kernel recompilation.
Similar issues were found in GitHub issue #313 , often due to a low‑version Linux kernel (e.g., 3.10.0‑862.el7.x86_64) on the host.
Kubelet host kernel version too low.
Can be mitigated by disabling kmem accounting.
The kernel version was indeed low.
Node2 Maintenance
Mark node2 as unschedulable kubectl cordon node02 Drain pods from node2
kubectl drain node02 --delete-local-data --ignore-daemonsets --force --delete-local-dataremoves local data, including emptyDir. --ignore-daemonsets prevents DaemonSet pods from being recreated. --force forces deletion of all pod controllers.
All pods were successfully evicted.
Reboot node02
After reboot, node02 recovered and was marked schedulable again.
kubectl uncordon node02Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
