Cloud Native 11 min read

16 Must‑Watch Kubernetes Metrics to Keep Your Cluster Healthy

This article identifies the 16 most critical Kubernetes metrics—ranging from crash loops and CPU utilization to etcd leader status and application metrics—that you should monitor and alert on to maintain a healthy, performant cluster.

Ops Development Stories
Ops Development Stories
Ops Development Stories
16 Must‑Watch Kubernetes Metrics to Keep Your Cluster Healthy

Kubernetes generates millions of new metrics daily, making it challenging to filter the important ones for cluster health monitoring.

1. Crash Loops

Crash loops occur when a pod repeatedly crashes and restarts, preventing the application from running.

Often caused by application crashes inside the pod.

May result from misconfigurations in the pod or deployment process.

Logs must be examined to resolve the issue.

Open‑source component

kube-eventer

can be used to push events.

2. CPU Utilization

CPU utilization measures the percentage of CPU used by a node. Monitoring is important for two reasons:

Applications must not exhaust their allocated CPU; if limited, increase CPU allocation or pod count, possibly adding more servers.

Idle CPUs indicate over‑provisioned resources and wasted cost.

3. Disk Pressure

Disk pressure signals that a node is using too much disk space or consuming it too quickly, based on configured thresholds.

If the application legitimately needs more space, additional disk capacity may be required.

Unexpected rapid disk consumption may indicate abnormal application behavior.

4. Memory Pressure

Memory pressure indicates insufficient memory on a node.

It may reveal memory leaks within applications.

5. PID Pressure

PID pressure is a rare condition where pods or containers generate excessive processes, exhausting the node’s available process IDs.

Each node has a finite number of PIDs to allocate.

When IDs are exhausted, new processes cannot start.

Kubernetes allows setting PID thresholds for pods; PID pressure means one or more pods have exhausted their allocated PIDs and need investigation.

6. Network Unavailable

All nodes require network connectivity; this condition indicates network connection problems on a node.

Caused by misconfiguration (e.g., routing exhaustion) or physical network hardware issues.

Open‑source component

KubeNurse

can be used for cluster network monitoring.

7. Job Failures

Jobs run pods for a limited time and release them after completing their intended function.

Failures may occur due to node crashes, restarts, or resource exhaustion.

While not always indicating application inaccessibility, unresolved failures can cause future issues.

Open‑source component

kube-eventer

can be used to push events.

8. Persistent Volume Failures

Persistent volumes provide storage for pods and are bound to a pod for its lifecycle, then reclaimed when no longer needed.

If reclamation fails for any reason, it signals problems with persistent storage.

9. Pod Pending Delays

A pod in the "pending" state is waiting to be scheduled on a node; prolonged pending usually means insufficient resources.

May require adjusting CPU/memory allocations, deleting pods, or adding more nodes.

Open‑source component

kube-eventer

can be used to push events.

10. Deployment Glitches

Deployments manage stateless applications where pods are interchangeable.

Monitor deployments to ensure they complete correctly; observed vs. desired replica counts must match, otherwise deployments have failed.

11. StatefulSets Not Ready

StatefulSets manage stateful applications with pods that have specific roles.

Ensure observed StatefulSet count matches the desired count; mismatches indicate failures.

Open‑source component

kube-eventer

can be used to push events.

12. DaemonSets Not Ready

DaemonSets run services or applications on every node in the cluster.

Ensure observed DaemonSet count matches the desired count; mismatches indicate failures.

Open‑source component

kube-eventer

can be used to push events.

13. etcd Leaders

An etcd cluster should always have a leader, except during leader transitions.

Metric

etcd_server_has_leader

indicates whether a leader exists.

Metric

etcd_server_leader_changes_seen_total

counts leader changes, which may signal connectivity or resource issues.

14. Scheduler Problems

The scheduler has two aspects to monitor.

Track

scheduler_schedule_attempts_total{result="unschedulable"}

; an increase in unschedulable pods may indicate resource problems.

Monitor scheduler latency metrics; rising scheduling delays can also point to cluster resource issues.

15. Events

Beyond numeric metrics, collecting and tracking cluster events is valuable for monitoring pod lifecycle and major failures; sudden changes in event rate can serve as early warning signs.

Open‑source component

kube-eventer

can be used to push events.

16. Application Metrics

Unlike the previous metrics, application metrics are emitted by workloads running in the cluster, covering error responses, request latency, processing time, etc.

Traditional approach: applications "push" metrics to a collection endpoint.

Modern approach: collectors "pull" metrics from applications (e.g., OpenMetrics), simplifying application code and enabling powerful discovery when combined with service discovery.

Summary

Monitoring Kubernetes health can be complex and overwhelming, but focusing on these high‑value metrics helps filter noise, prioritize issues, and maintain a reliable cluster experience.

Monitoringcloud nativeoperationsKubernetesMetrics
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.