Detect and Visualize Node-Level Failures in Kubernetes with NPD and Grafana
Learn how to proactively detect node‑level system anomalies in Kubernetes using the Node Problem Detector, expose its metrics to Prometheus, and visualize alerts in Grafana, including step‑by‑step commands for pod inspection, ServiceMonitor creation, and dashboard import.
In daily Kubernetes cluster operations, we rely on Prometheus + Grafana to monitor CPU, memory, disk, and use log systems like ELK to troubleshoot containers. However, node‑level system exceptions are often overlooked, such as kernel deadlocks, disk I/O errors, read‑only file systems, and hardware failures.
Kernel deadlock (kernel: BUG: soft lockup)
Disk I/O error (end_request: I/O error)
Filesystem read‑only (EXT4-fs error: remounting filesystem read-only)
Hardware failure (MCE: Hardware error)
The Kubernetes community introduced Node Problem Detector (NPD) to actively detect these system anomalies and, combined with Grafana, provide visualized alerts for early detection and warning.
Node Problem Detector exposes metrics by default. After installation, you can list its pods:
$ k -n kube-system get pod -owide -l app=node-problem-detector
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
node-problem-detector-4r95l 1/1 Running 4 (5h49m ago) 25h 10.244.135.175 k8s-node03 <none> <none>
node-problem-detector-6gqtk 1/1 Running 4 (5h37m ago) 25h 10.244.195.55 k8s-master03 <none> <none>
node-problem-detector-9bv6x 1/1 Running 4 (5h37m ago) 25h 10.244.122.187 k8s-master02 <none> <none>
node-problem-detector-gfzcm 1/1 Running 4 (5h37m ago) 25h 10.244.58.204 k8s-node02 <none> <none>
node-problem-detector-ghhw6 1/1 Running 4 (5h49m ago) 25h 10.244.217.146 k8s-node04 <none> <none>
node-problem-detector-pxxdx 1/1 Running 4 (5h49m ago) 25h 10.244.85.240 k8s-node01 <none> <none>
node-problem-detector-ttlxv 1/1 Running 4 (5h37m ago) 25h 10.244.32.184 k8s-master01 <none> <none>Then fetch its metrics endpoint:
$ curl 10.244.32.184:20257/metrics
# HELP problem_counter Number of times a specific type of problem have occurred.
# TYPE problem_counter counter
problem_counter{reason="CorruptDockerImage"} 0
problem_counter{reason="CorruptDockerOverlay2"} 0
... (additional metric lines omitted for brevity) ...
# HELP problem_gauge Whether a specific type of problem is affecting the node or not.
# TYPE problem_gauge gauge
problem_gauge{reason="CorruptDockerOverlay2",type="CorruptDockerOverlay2"} 0
problem_gauge{reason="DockerHung",type="KernelDeadlock"} 0
problem_gauge{reason="FilesystemIsReadOnly",type="ReadonlyFilesystem"} 0To let Prometheus scrape these metrics, create a ServiceMonitor:
cat <<'EOF' | kubectl apply -f -
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
release: monitor
name: node-problem-detector
namespace: kube-system
spec:
endpoints:
- interval: 60s
path: /metrics
port: exporter
relabelings:
- action: replace
sourceLabels:
- __meta_kubernetes_pod_node_name
targetLabel: node
- action: replace
sourceLabels:
- __meta_kubernetes_pod_host_ip
targetLabel: host_ip
namespaceSelector:
matchNames:
- kube-system
selector:
matchLabels:
app: node-problem-detector
EOFVerify that Prometheus has the target configured (screenshot omitted).
Import the Grafana dashboard with ID 15549 (Node Problem Detector) to visualize the metrics.
Node Problem Detector fills the gap of “event awareness” in the Kubernetes monitoring stack, while Grafana makes these events visible, searchable, and pre‑emptive. Deploy NPD in production clusters and tailor detection rules and dashboards to achieve early fault awareness.
Linux Ops Smart Journey
The operations journey never stops—pursuing excellence endlessly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
