Operations 11 min read

Which Kubernetes Health Metrics Really Matter? A Practical Guide

This article explains the most critical Kubernetes health metrics—including resource utilization, object state, control‑plane, events, and application metrics—provides concrete metric names, why they matter, and how to monitor and alert on them to keep clusters reliable and performant.

Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Which Kubernetes Health Metrics Really Matter? A Practical Guide

Resource and Utilization Metrics

Metrics from the built‑in metrics API (exposed by kubelet) that describe node and pod resource consumption.

CPU usage (usageNanoCores) : nanocores used per second by a node or pod.

CPU capacity (capacity_cpu) : total number of CPU cores available on a node.

Memory usage (used{resource:memory,units:bytes}) : bytes of memory currently used by a node or pod.

Memory capacity (capacity_memory{units:bytes}) : total bytes of memory available on a node.

Network traffic (rx{resource:network,units:bytes} / tx{resource:network,units:bytes}) : inbound and outbound bytes observed on a node or pod.

CPU usage is the primary health indicator; sustained high usage may require additional CPU allocation or nodes, while consistently low usage indicates over‑provisioning.

State Metrics (kube‑state‑metrics)

Provides the current status of Kubernetes objects.

Node conditions (kube_node_status_condition) : reports true when a node is under OutOfDisk, MemoryPressure, PIDPressure, DiskPressure, or NetworkUnavailable.

CrashLoopBackOff (kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}) : indicates a container repeatedly crashing.

Job failures (kube_job_status_failed) : count of failed batch jobs.

PersistentVolume failures (kube_persistentvolume_status_phase{phase="Failed"}) .

Pod pending (kube_pod_status_phase{phase="Pending"}) .

Deployment generation (kube_deployment_metadata_generation) and observed generation (kube_deployment_status_observed_generation) .

DaemonSet desired nodes (kube_daemonset_status_desired_number_scheduled) and current nodes (kube_daemonset_status_current_number_scheduled) .

StatefulSet desired replicas (kube_statefulset_status_replicas) and ready replicas (kube_statefulset_status_replicas_ready) .

Alert on crash loops, any node pressure condition, job failures, PV failures, pods stuck in Pending, mismatched Deployment generations, and DaemonSet/StatefulSet readiness problems.

Control‑Plane Metrics

etcd leader presence (etcd_server_has_leader) : 1 if the member knows its leader.

etcd leader changes (etcd_server_leader_changes_seen_total) : total number of leader elections; a high rate may indicate connectivity or resource issues.

API request count (apiserver_request_latencies_count) and total latency (apiserver_request_latencies_sum) : average latency = sum / count.

Workqueue queue duration (workqueue_queue_duration_seconds) : time items spend waiting in controller manager queues.

Workqueue processing duration (workqueue_work_duration_seconds) : time spent processing items.

Scheduler unschedulable attempts (scheduler_schedule_attempts_total{result="unschedulable"}) : number of pods the scheduler could not place.

Pod scheduling latency (scheduler_e2e_scheduling_duration_seconds) (or scheduler_e2e_scheduling_delay_microseconds for versions < v1.14): total time from pod creation to being bound to a node.

Control‑Plane Health Checks

Continuously monitor etcd leadership; loss of a leader degrades performance. Frequent leader changes suggest cluster instability. Track average API latency to detect API‑server overload. Watch workqueue delays for controller bottlenecks. Rising unschedulable attempts or scheduling latency are early signs of resource exhaustion.

Cluster Events

In addition to numeric metrics, ingest Kubernetes events (e.g., pod lifecycle events). A sudden spike in event rate often precedes larger failures and can be used as a warning signal.

Application Metrics

Application‑level metrics are emitted by workloads, not by Kubernetes itself. They can be collected via push (e.g., StatsD) or pull (OpenMetrics). Pull‑based collection is preferred because the application only needs to expose an /metrics endpoint, while the collector handles discovery and scraping.

Summary

Kubernetes can emit millions of new metrics per day. Monitoring systems must be able to ingest this volume, filter noise, and surface the most critical signals. Effective monitoring therefore requires:

Collecting core resource, state, and control‑plane metrics listed above.

Alerting on the specific conditions described.

Using a pull‑based collector (e.g., Prometheus) to handle large metric cardinality and to integrate with service discovery.

Reference: https://www.kubernetes.org.cn/8752.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Prometheushealth-metrics
Full-Stack DevOps & Kubernetes
Written by

Full-Stack DevOps & Kubernetes

Focused on sharing DevOps, Kubernetes, Linux, Docker, Istio, microservices, Spring Cloud, Python, Go, databases, Nginx, Tomcat, cloud computing, and related technologies.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.