How to Diagnose and Fix Node2 Ceph‑Related cgroup Leaks in a Kubernetes Cluster
This article walks through a real‑world Kubernetes incident where a node ran out of space due to Ceph storage inconsistencies and cgroup leaks, detailing step‑by‑step diagnostics, Ceph repair commands, pod eviction, node reboot, and post‑mortem recommendations for cluster operations.
Background
Received an alert from the test environment cluster and logged into the Kubernetes cluster for investigation.
Fault Diagnosis
2.1 Check Pods
Observed abnormal Calico pod on the kube‑system node2.
Detailed inspection revealed that node2 had no storage space and a cgroup leak.
2.2 Check Storage
Logged into node2 to view server storage information; space appeared sufficient.
The cluster uses Ceph distributed storage, so the Ceph cluster status was examined.
Operations
3.1 Ceph Repair
Detected Ceph cluster anomalies that could cause node2 cgroup leaks and performed a manual Ceph repair.
Data inconsistency (incorrect object size or missing objects after recovery) can lead to scrub errors.
Ceph may encounter mismatched object size information during storage, causing cleanup failures.
Identified problematic PG 1.7c and repaired it.
<code>ceph pg repair 1.7c</code>After repair, the Ceph cluster recovered.
3.2 Pod Repair
Deleted the abnormal pod; the controller automatically recreated the latest pod.
Pod remained unchanged, likely due to Ceph issues causing node2 cgroup leaks; further research suggested kernel version too low (Linux 3.10.0‑862.el7.x86_64) and disabling kmem could help.
3.3 Further Fault Diagnosis
During container startup, runc enables kmem accounting by default, which can cause leaks on kernel 3.10.
Rebooting the server with “no space left” resolves the issue, possibly triggered by mass pod deletions.
3.4 Node2 Maintenance
3.4.1 Mark node2 as unschedulable
<code>kubectl cordon node02</code>3.4.2 Drain pods from node2
<code>kubectl drain node02 --delete-local-data --ignore-daemonsets --force</code>Options explained:
--delete-local-data: delete local data, including emptyDir.
--ignore-daemonsets: ignore DaemonSets to prevent automatic recreation.
--force: force deletion of all pods, including those managed by ReplicationController, ReplicaSet, DaemonSet, StatefulSet, or Job.
All pods on node2 were successfully evicted.
During migration, pods are rebuilt before termination, so service interruption time equals rebuild time plus startup time plus readiness probe time; the service is considered normal only after reaching 1/1 Running.
3.4.3 Reboot node02
After reboot, node02 was restored and ready for scheduling.
<code>kubectl uncordon node02</code>Reflection
Future work includes upgrading the kernel of the Kubernetes cluster.
Pod anomalies may stem from underlying storage issues; precise diagnosis and targeted fixes are essential.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.