Operations 8 min read

How to Detect and Auto‑Heal Node Failures in Kubernetes with Node Problem Detector

This article explains why Kubernetes nodes need deeper health monitoring, introduces the Node Problem Detector (NPD) component, outlines its detection methods, and provides step‑by‑step instructions to deploy, configure, and verify NPD for automatic alerts and self‑healing in a cluster.

Linux Ops Smart Journey
Linux Ops Smart Journey
Linux Ops Smart Journey
How to Detect and Auto‑Heal Node Failures in Kubernetes with Node Problem Detector

In a Kubernetes cluster, nodes are the fundamental units that run workloads, and hidden system‑level issues such as disk failures or kernel deadlocks can cause pod scheduling problems or service outages.

What is Node Problem Detector?

Node Problem Detector (NPD) is an open‑source DaemonSet maintained by the Kubernetes SIG‑Node that watches system logs, metrics, or custom scripts on each node and translates detected problems into Kubernetes‑recognizable NodeCondition objects or Event s, enabling alerts, automatic eviction, and self‑healing.

Detection Methods

Log‑based monitoring : scans kernel OOM messages, read‑only filesystem alerts, hardware errors, etc.

Custom plugin : runs arbitrary shell scripts and evaluates exit codes or output to determine anomalies.

Metrics‑based (experimental) : collects metrics and triggers when thresholds are exceeded.

Actions When a Problem Is Detected

Creates a corresponding NodeCondition (e.g., KernelDeadlock, ReadonlyFilesystem).

Sends an Event to the API server for auditing and alerting.

Deploying NPD

1. Download the NPD Helm chart:

$ helm repo add deliveryhero https://charts.deliveryhero.io/
$ helm pull deliveryhero/node-problem-detector
$ helm push node-problem-detector-2.3.14.tgz oci://core.jiaxzeng.com/plugins
$ sudo helm pull --untar --untardir /etc/kubernetes/addons oci://core.jiaxzeng.com/plugins/node-problem-detector --version 2.3.14

2. Pull the NPD container image:

$ docker pull swr.cn-north-4.myhuaweicloud.com/ddn-k8s/registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.19
$ docker tag swr.cn-north-4.myhuaweicloud.com/ddn-k8s/registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.19 core.jiaxzeng.com/library/node-problem-detector:v0.8.19
$ docker push core.jiaxzeng.com/library/node-problem-detector:v0.8.19

3. Create the values file for the deployment:

$ sudo cat /etc/kubernetes/addons/node-problem-detector-values.yaml
fullnameOverride: "node-problem-detector"

image:
  repository: core.jiaxzeng.com/library/node-problem-detector
  tag: v0.8.19
  pullPolicy: IfNotPresent

metrics:
  enabled: true
  serviceMonitor:
    enabled: true
    additionalLabels:
      release: monitor
  prometheusRule:
    enabled: true

4. Install NPD into the kube-system namespace:

$ helm -n kube-system install node-problem-detector -f /etc/kubernetes/addons/node-problem-detector-values.yaml /etc/kubernetes/addons/node-problem-detector
NAME: node-problem-detector
LAST DEPLOYED: Fri Oct 10 14:14:33 2025
NAMESPACE: kube-system
STATUS: deployed

Validating the NPD Service

Check that the pods are running:

$ kubectl --namespace=kube-system get pods -l "app.kubernetes.io/name=node-problem-detector,app.kubernetes.io/instance=node-problem-detector"
NAME                     READY   STATUS    RESTARTS   AGE
node-problem-detector-4r95l   1/1   Running   0   85s
...

Simulate a kernel NULL‑pointer dereference:

$ sudo sh -c "echo 'kernel: BUG: unable to handle kernel NULL pointer dereference at TESTING' >> /dev/kmsg"
[sudo] password for ops:

Simulate a Docker hang situation:

sudo sh -c "echo 'kernel: INFO: task docker:20744 blocked for more than 120 seconds.' >> /dev/kmsg"

Conclusion

In the cloud‑native era, system stability relies on automation, observability, and self‑healing rather than manual checks. Although small, Node Problem Detector is a crucial piece of the Kubernetes observability puzzle, turning silent node failures into visible events and giving clusters the ability to “feel pain” and recover automatically.

KubernetesDevOpsself-healingNode Problem DetectorCluster Monitoring
Linux Ops Smart Journey
Written by

Linux Ops Smart Journey

The operations journey never stops—pursuing excellence endlessly.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.