Cloud Native 6 min read

Detect and Visualize Node-Level Failures in Kubernetes with NPD and Grafana

Learn how to proactively detect node‑level system anomalies in Kubernetes using the Node Problem Detector, expose its metrics to Prometheus, and visualize alerts in Grafana, including step‑by‑step commands for pod inspection, ServiceMonitor creation, and dashboard import.

Linux Ops Smart Journey
Linux Ops Smart Journey
Linux Ops Smart Journey
Detect and Visualize Node-Level Failures in Kubernetes with NPD and Grafana

In daily Kubernetes cluster operations, we rely on Prometheus + Grafana to monitor CPU, memory, disk, and use log systems like ELK to troubleshoot containers. However, node‑level system exceptions are often overlooked, such as kernel deadlocks, disk I/O errors, read‑only file systems, and hardware failures.

Kernel deadlock (kernel: BUG: soft lockup)

Disk I/O error (end_request: I/O error)

Filesystem read‑only (EXT4-fs error: remounting filesystem read-only)

Hardware failure (MCE: Hardware error)

The Kubernetes community introduced Node Problem Detector (NPD) to actively detect these system anomalies and, combined with Grafana, provide visualized alerts for early detection and warning.

Node Problem Detector overview
Node Problem Detector overview

Node Problem Detector exposes metrics by default. After installation, you can list its pods:

$ k -n kube-system get pod -owide -l app=node-problem-detector
NAME                     READY   STATUS    RESTARTS   AGE   IP            NODE        NOMINATED NODE   READINESS GATES
node-problem-detector-4r95l   1/1   Running   4 (5h49m ago)   25h   10.244.135.175   k8s-node03   <none>   <none>
node-problem-detector-6gqtk   1/1   Running   4 (5h37m ago)   25h   10.244.195.55    k8s-master03   <none>   <none>
node-problem-detector-9bv6x   1/1   Running   4 (5h37m ago)   25h   10.244.122.187   k8s-master02   <none>   <none>
node-problem-detector-gfzcm   1/1   Running   4 (5h37m ago)   25h   10.244.58.204    k8s-node02   <none>   <none>
node-problem-detector-ghhw6   1/1   Running   4 (5h49m ago)   25h   10.244.217.146   k8s-node04   <none>   <none>
node-problem-detector-pxxdx   1/1   Running   4 (5h49m ago)   25h   10.244.85.240    k8s-node01   <none>   <none>
node-problem-detector-ttlxv   1/1   Running   4 (5h37m ago)   25h   10.244.32.184    k8s-master01   <none>   <none>

Then fetch its metrics endpoint:

$ curl 10.244.32.184:20257/metrics
# HELP problem_counter Number of times a specific type of problem have occurred.
# TYPE problem_counter counter
problem_counter{reason="CorruptDockerImage"} 0
problem_counter{reason="CorruptDockerOverlay2"} 0
... (additional metric lines omitted for brevity) ...
# HELP problem_gauge Whether a specific type of problem is affecting the node or not.
# TYPE problem_gauge gauge
problem_gauge{reason="CorruptDockerOverlay2",type="CorruptDockerOverlay2"} 0
problem_gauge{reason="DockerHung",type="KernelDeadlock"} 0
problem_gauge{reason="FilesystemIsReadOnly",type="ReadonlyFilesystem"} 0

To let Prometheus scrape these metrics, create a ServiceMonitor:

cat <<'EOF' | kubectl apply -f -
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    release: monitor
  name: node-problem-detector
  namespace: kube-system
spec:
  endpoints:
  - interval: 60s
    path: /metrics
    port: exporter
    relabelings:
    - action: replace
      sourceLabels:
      - __meta_kubernetes_pod_node_name
      targetLabel: node
    - action: replace
      sourceLabels:
      - __meta_kubernetes_pod_host_ip
      targetLabel: host_ip
  namespaceSelector:
    matchNames:
    - kube-system
  selector:
    matchLabels:
      app: node-problem-detector
EOF

Verify that Prometheus has the target configured (screenshot omitted).

Prometheus target view
Prometheus target view

Import the Grafana dashboard with ID 15549 (Node Problem Detector) to visualize the metrics.

Grafana dashboard example
Grafana dashboard example

Node Problem Detector fills the gap of “event awareness” in the Kubernetes monitoring stack, while Grafana makes these events visible, searchable, and pre‑emptive. Deploy NPD in production clusters and tailor detection rules and dashboards to achieve early fault awareness.

cloud nativeKubernetesPrometheusGrafanaNode Problem Detector
Linux Ops Smart Journey
Written by

Linux Ops Smart Journey

The operations journey never stops—pursuing excellence endlessly.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.