Which Kubernetes Health Metrics Really Matter? A Practical Guide
This article explains the most critical Kubernetes health metrics—including resource utilization, object state, control‑plane, events, and application metrics—provides concrete metric names, why they matter, and how to monitor and alert on them to keep clusters reliable and performant.
Resource and Utilization Metrics
Metrics from the built‑in metrics API (exposed by kubelet) that describe node and pod resource consumption.
CPU usage (usageNanoCores) : nanocores used per second by a node or pod.
CPU capacity (capacity_cpu) : total number of CPU cores available on a node.
Memory usage (used{resource:memory,units:bytes}) : bytes of memory currently used by a node or pod.
Memory capacity (capacity_memory{units:bytes}) : total bytes of memory available on a node.
Network traffic (rx{resource:network,units:bytes} / tx{resource:network,units:bytes}) : inbound and outbound bytes observed on a node or pod.
CPU usage is the primary health indicator; sustained high usage may require additional CPU allocation or nodes, while consistently low usage indicates over‑provisioning.
State Metrics (kube‑state‑metrics)
Provides the current status of Kubernetes objects.
Node conditions (kube_node_status_condition) : reports true when a node is under OutOfDisk, MemoryPressure, PIDPressure, DiskPressure, or NetworkUnavailable.
CrashLoopBackOff (kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}) : indicates a container repeatedly crashing.
Job failures (kube_job_status_failed) : count of failed batch jobs.
PersistentVolume failures (kube_persistentvolume_status_phase{phase="Failed"}) .
Pod pending (kube_pod_status_phase{phase="Pending"}) .
Deployment generation (kube_deployment_metadata_generation) and observed generation (kube_deployment_status_observed_generation) .
DaemonSet desired nodes (kube_daemonset_status_desired_number_scheduled) and current nodes (kube_daemonset_status_current_number_scheduled) .
StatefulSet desired replicas (kube_statefulset_status_replicas) and ready replicas (kube_statefulset_status_replicas_ready) .
Alert on crash loops, any node pressure condition, job failures, PV failures, pods stuck in Pending, mismatched Deployment generations, and DaemonSet/StatefulSet readiness problems.
Control‑Plane Metrics
etcd leader presence (etcd_server_has_leader) : 1 if the member knows its leader.
etcd leader changes (etcd_server_leader_changes_seen_total) : total number of leader elections; a high rate may indicate connectivity or resource issues.
API request count (apiserver_request_latencies_count) and total latency (apiserver_request_latencies_sum) : average latency = sum / count.
Workqueue queue duration (workqueue_queue_duration_seconds) : time items spend waiting in controller manager queues.
Workqueue processing duration (workqueue_work_duration_seconds) : time spent processing items.
Scheduler unschedulable attempts (scheduler_schedule_attempts_total{result="unschedulable"}) : number of pods the scheduler could not place.
Pod scheduling latency (scheduler_e2e_scheduling_duration_seconds) (or scheduler_e2e_scheduling_delay_microseconds for versions < v1.14): total time from pod creation to being bound to a node.
Control‑Plane Health Checks
Continuously monitor etcd leadership; loss of a leader degrades performance. Frequent leader changes suggest cluster instability. Track average API latency to detect API‑server overload. Watch workqueue delays for controller bottlenecks. Rising unschedulable attempts or scheduling latency are early signs of resource exhaustion.
Cluster Events
In addition to numeric metrics, ingest Kubernetes events (e.g., pod lifecycle events). A sudden spike in event rate often precedes larger failures and can be used as a warning signal.
Application Metrics
Application‑level metrics are emitted by workloads, not by Kubernetes itself. They can be collected via push (e.g., StatsD) or pull (OpenMetrics). Pull‑based collection is preferred because the application only needs to expose an /metrics endpoint, while the collector handles discovery and scraping.
Summary
Kubernetes can emit millions of new metrics per day. Monitoring systems must be able to ingest this volume, filter noise, and surface the most critical signals. Effective monitoring therefore requires:
Collecting core resource, state, and control‑plane metrics listed above.
Alerting on the specific conditions described.
Using a pull‑based collector (e.g., Prometheus) to handle large metric cardinality and to integrate with service discovery.
Reference: https://www.kubernetes.org.cn/8752.html
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Full-Stack DevOps & Kubernetes
Focused on sharing DevOps, Kubernetes, Linux, Docker, Istio, microservices, Spring Cloud, Python, Go, databases, Nginx, Tomcat, cloud computing, and related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
