16 Must‑Watch Kubernetes Metrics to Keep Your Cluster Healthy
This article identifies the 16 most critical Kubernetes metrics—ranging from crash loops and CPU utilization to etcd leader status and application metrics—that you should monitor and alert on to maintain a healthy, performant cluster.
Kubernetes generates millions of new metrics daily, making it challenging to filter the important ones for cluster health monitoring.
1. Crash Loops
Crash loops occur when a pod repeatedly crashes and restarts, preventing the application from running.
Often caused by application crashes inside the pod.
May result from misconfigurations in the pod or deployment process.
Logs must be examined to resolve the issue.
Open‑source component
kube-eventercan be used to push events.
2. CPU Utilization
CPU utilization measures the percentage of CPU used by a node. Monitoring is important for two reasons:
Applications must not exhaust their allocated CPU; if limited, increase CPU allocation or pod count, possibly adding more servers.
Idle CPUs indicate over‑provisioned resources and wasted cost.
3. Disk Pressure
Disk pressure signals that a node is using too much disk space or consuming it too quickly, based on configured thresholds.
If the application legitimately needs more space, additional disk capacity may be required.
Unexpected rapid disk consumption may indicate abnormal application behavior.
4. Memory Pressure
Memory pressure indicates insufficient memory on a node.
It may reveal memory leaks within applications.
5. PID Pressure
PID pressure is a rare condition where pods or containers generate excessive processes, exhausting the node’s available process IDs.
Each node has a finite number of PIDs to allocate.
When IDs are exhausted, new processes cannot start.
Kubernetes allows setting PID thresholds for pods; PID pressure means one or more pods have exhausted their allocated PIDs and need investigation.
6. Network Unavailable
All nodes require network connectivity; this condition indicates network connection problems on a node.
Caused by misconfiguration (e.g., routing exhaustion) or physical network hardware issues.
Open‑source component
KubeNursecan be used for cluster network monitoring.
7. Job Failures
Jobs run pods for a limited time and release them after completing their intended function.
Failures may occur due to node crashes, restarts, or resource exhaustion.
While not always indicating application inaccessibility, unresolved failures can cause future issues.
Open‑source component
kube-eventercan be used to push events.
8. Persistent Volume Failures
Persistent volumes provide storage for pods and are bound to a pod for its lifecycle, then reclaimed when no longer needed.
If reclamation fails for any reason, it signals problems with persistent storage.
9. Pod Pending Delays
A pod in the "pending" state is waiting to be scheduled on a node; prolonged pending usually means insufficient resources.
May require adjusting CPU/memory allocations, deleting pods, or adding more nodes.
Open‑source component
kube-eventercan be used to push events.
10. Deployment Glitches
Deployments manage stateless applications where pods are interchangeable.
Monitor deployments to ensure they complete correctly; observed vs. desired replica counts must match, otherwise deployments have failed.
11. StatefulSets Not Ready
StatefulSets manage stateful applications with pods that have specific roles.
Ensure observed StatefulSet count matches the desired count; mismatches indicate failures.
Open‑source component
kube-eventercan be used to push events.
12. DaemonSets Not Ready
DaemonSets run services or applications on every node in the cluster.
Ensure observed DaemonSet count matches the desired count; mismatches indicate failures.
Open‑source component
kube-eventercan be used to push events.
13. etcd Leaders
An etcd cluster should always have a leader, except during leader transitions.
Metric
etcd_server_has_leaderindicates whether a leader exists.
Metric
etcd_server_leader_changes_seen_totalcounts leader changes, which may signal connectivity or resource issues.
14. Scheduler Problems
The scheduler has two aspects to monitor.
Track
scheduler_schedule_attempts_total{result="unschedulable"}; an increase in unschedulable pods may indicate resource problems.
Monitor scheduler latency metrics; rising scheduling delays can also point to cluster resource issues.
15. Events
Beyond numeric metrics, collecting and tracking cluster events is valuable for monitoring pod lifecycle and major failures; sudden changes in event rate can serve as early warning signs.
Open‑source component
kube-eventercan be used to push events.
16. Application Metrics
Unlike the previous metrics, application metrics are emitted by workloads running in the cluster, covering error responses, request latency, processing time, etc.
Traditional approach: applications "push" metrics to a collection endpoint.
Modern approach: collectors "pull" metrics from applications (e.g., OpenMetrics), simplifying application code and enabling powerful discovery when combined with service discovery.
Summary
Monitoring Kubernetes health can be complex and overwhelming, but focusing on these high‑value metrics helps filter noise, prioritize issues, and maintain a reliable cluster experience.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.