Why Kubernetes Nodes Go NotReady: A Deep Dive into Systemd, Docker, and PLEG
An Alibaba Cloud expert walks through a rare Kubernetes node NotReady issue, detailing how PLEG failures, Docker daemon stack traces, runC’s D‑Bus interactions, and a systemd cookie overflow combine to stall nodes, and explains the debugging steps and eventual fix.
An Alibaba Cloud post‑mortem describes a low‑probability Kubernetes node NotReady problem that appears roughly once a month in production clusters.
Background
Kubernetes clusters consist of Master and Worker nodes. The Master runs control‑plane components, while Workers run user workloads. Each node runs a kubelet agent that communicates with the control plane.
Initial Symptom
When a node becomes NotReady, the Master cannot schedule new Pods or retrieve runtime information. systemctl status kubelet shows the service as running, but journalctl -u kubelet reveals errors indicating that the container runtime is unhealthy.
What Is PLEG?
PLEG (Pod Lifecycle Event Generator) is a health‑check mechanism used by the kubelet to monitor the container runtime (Docker daemon). The error messages point to a non‑functional runtime, which leads to the node being marked NotReady.
Docker Daemon Stack Analysis
Docker, after version 1.11, is split into docker daemon, containerd, containerd‑shim, and runC. To inspect the daemon, the command kill -USR1 <pid> is used, which writes the stack trace to /var/run/docker. The typical stack shows request routing in the lower half and specific handling functions in the upper half, eventually waiting on a mutex.
Containerd Stack Analysis
Similar to Docker, kill -SIGUSR1 <pid> on the containerd process writes the stack to the messages log. The stack reveals that a thread created by containerd uses runC to launch container processes. Some runC processes remain running, indicating that runC failed to complete container creation.
D‑Bus Interaction
runCcommunicates with the system via D‑Bus. Strace shows runC hanging while writing to a socket with the org.freedesktop field. The D‑Bus daemon appears functional, but the issue persists after restarting it.
Systemd Investigation
Live debugging of systemd with gdb shows the process hitting sd_bus_message_seal and returning EOPNOTSUPP. The root cause is a 32‑bit cookie counter used to track D‑Bus messages. After processing a large number of unit create/delete operations, the counter overflows beyond 0xffffffff, preventing new messages from being sealed.
Diagnosing the Issue on a Problematic Node
Install gdb and systemd debug symbols, attach to systemd1, set a breakpoint at sd_bus_send, and inspect the cookie value. If it exceeds 0xffffffff, the overflow is the cause of the NotReady state.
Fix
The fix changes the handling of the cookie so that both dbus1 (32‑bit) and dbus2 (64‑bit) use a unified 32‑bit counter. When the counter reaches 0xfffffff, the next value is set to 0x80000000 to mark overflow, and the system ensures the new cookie is not already in use, avoiding collisions.
This patch has been accepted by Red Hat and will be delivered in upcoming systemd releases, eliminating the need for manual workarounds.
Conclusion
The rare NotReady problem stems from a systemd cookie overflow triggered by frequent UseSystemd calls in runC. By understanding the stack traces of Docker daemon, containerd, and systemd, and by applying the upstream fix, clusters can avoid the node‑stalling issue.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
