Essential Prometheus Operator Metrics for Kubernetes: Prevent Alert Overload
This guide explains the most common Prometheus Operator metrics for Kubernetes, detailing each metric's purpose, the PromQL expression to monitor it, and the related underlying metrics, helping you fine‑tune alerts and avoid unnecessary noise in your cluster monitoring.
After installing the Prometheus Operator, many default monitoring metrics can generate a large number of alerts if not carefully managed, so it is essential to understand these common metrics and adjust them as needed.
1. Kubernetes Resource Metrics
1.1 CPUThrottlingHigh – Checks the reasonableness of CPU limits by finding containers where more than 25% of CPU cycles were throttled in the last 5 minutes.
sum(increase(container_cpu_cfs_throttled_periods_total{container!=""}[5m])) by (container, pod, namespace)
/
sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace)
> (25 / 100)Related metrics:
container_cpu_cfs_periods_total – total number of CPU periods in a container's lifetime
container_cpu_cfs_throttled_periods_total – total number of throttled CPU periods in a container's lifetime
1.2 KubeCPUOvercommit – Detects cluster‑wide CPU over‑commitment that could make the cluster intolerant to node failures.
sum(namespace:kube_pod_container_resource_requests_cpu_cores:sum{})
/
sum(kube_node_status_allocatable_cpu_cores)
> (count(kube_node_status_allocatable_cpu_cores)-1) / count(kube_node_status_allocatable_cpu_cores)Related metrics:
kube_pod_container_resource_requests_cpu_cores – CPU cores requested by pods
kube_node_status_allocatable_cpu_cores – total allocatable CPU cores per node
1.3 KubeMemoryOvercommit – Detects memory over‑commitment that could affect cluster stability.
sum(namespace:kube_pod_container_resource_requests_memory_bytes:sum{})
/
sum(kube_node_status_allocatable_memory_bytes)
> (count(kube_node_status_allocatable_memory_bytes)-1) / count(kube_node_status_allocatable_memory_bytes)Related metrics:
kube_pod_container_resource_requests_memory_bytes – memory requested by pods
kube_node_status_allocatable_memory_bytes – total allocatable memory per node
1.4 KubeCPUQuotaOvercommit – Checks whether the total CPU limits exceed the cluster’s total CPU capacity.
sum(kube_pod_container_resource_limits_cpu_cores{job="kube-state-metrics"})
/
sum(kube_node_status_allocatable_cpu_cores)
> 1.1Related metrics:
kube_pod_container_resource_limits_cpu_cores – CPU limits set for pods
kube_node_status_allocatable_cpu_cores – total allocatable CPU cores per node
1.5 KubeMemoryQuotaOvercommit – Checks whether the total memory limits exceed the cluster’s total memory capacity.
sum(kube_pod_container_resource_limits_memory_bytes{job="kube-state-metrics"})
/
sum(kube_node_status_allocatable_memory_bytes{job="kube-state-metrics"})
> 1.1Related metrics:
kube_pod_container_resource_limits_memory_bytes – memory limits set for pods
kube_node_status_allocatable_memory_bytes – total allocatable memory per node
1.6 KubeMEMQuotaExceeded – Namespace‑level memory usage ratio; alerts when memory requests approach limits.
sum(kube_pod_container_resource_requests_memory_bytes{job="kube-state-metrics"}) by (namespace)
/ (sum(kube_pod_container_resource_limits_memory_bytes{job="kube-state-metrics"}) by (namespace))
> 0.8Related metrics:
kube_pod_container_resource_requests_memory_bytes – memory requested
kube_pod_container_resource_limits_memory_bytes – memory limit
1.7 KubeCPUQuotaExceeded – Namespace‑level CPU usage ratio; alerts when CPU requests approach limits.
sum(kube_pod_container_resource_requests_cpu_cores{job="kube-state-metrics"}) by (namespace)
/ (sum(kube_pod_container_resource_limits_cpu_cores{job="kube-state-metrics"}) by (namespace))
> 0.8Related metrics:
kube_pod_container_resource_requests_cpu_cores – CPU requested
kube_pod_container_resource_limits_cpu_cores – CPU limit
2. Kubernetes Storage Metrics
2.1 KubePersistentVolumeFillingUp – Monitors PVC capacity usage; alerts when available space falls below 30% of total.
kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics"}
/
kubelet_volume_stats_capacity_bytes{job="kubelet",metrics_path="/metrics"}
< 0.3Related metrics:
kubelet_volume_stats_available_bytes – free space
kubelet_volume_stats_capacity_bytes – total capacity
2.2 KubePersistentVolumeFillingUp (prediction) – Predicts disk exhaustion using a 6‑hour rate; alerts if projected free space will be negative within 4 days.
(kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics"}
/
kubelet_volume_stats_capacity_bytes{job="kubelet",metrics_path="/metrics"}) < 0.4
and
predict_linear(kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics"}[6h], 4*24*3600) < 0Related metrics:
kubelet_volume_stats_available_bytes – free space
kubelet_volume_stats_capacity_bytes – total capacity
2.3 KubePersistentVolumeErrors – Detects PersistentVolumes in Failed or Pending phases.
kube_persistentvolume_status_phase{phase=~"Failed|Pending",job="kube-state-metrics"}Related metrics:
kube_persistentvolume_status_phase – PV status
3. Kubernetes System Metrics
3.1 KubeVersionMismatch – Checks if component versions differ from the cluster version.
count(count by (gitVersion) (label_replace(kubernetes_build_info{job!~"kube-dns|coredns"},"gitVersion","$1","gitVersion","(v[0-9]*.[0-9]*.[0-9]*).*")))Related metrics:
kubernetes_build_info – component version information
3.2 KubeClientErrors – Client request error rate (5xx responses) over the last 5 minutes.
(sum(rate(rest_client_requests_total{code=~"5.."}[5m])) by (instance, job)
/
sum(rate(rest_client_requests_total[5m])) by (instance, job))
> 0.01Related metrics:
rest_client_requests_total – HTTP status codes of client requests
4. APIServer Metrics
4.1 KubeAPIErrorsHigh – API server request error rate (5xx) over 5 minutes.
sum(rate(apiserver_request_total{job="apiserver",code=~"5.."}[5m])) by (resource,subresource,verb)
/
sum(rate(apiserver_request_total{job="apiserver"}[5m])) by (resource,subresource,verb)
> 0.05Related metrics:
apiserver_request_total – total API server requests
4.2 KubeClientCertificateExpiration – Alerts when client certificates expire within 30 days or 7 days.
apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 2592000
apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 604800Related metrics:
apiserver_client_certificate_expiration_seconds_count – remaining certificate validity
4.3 AggregatedAPIErrors – Monitors custom‑registered API services; alerts if unavailable count exceeds 2 in 5 minutes.
sum by (name, namespace) (increase(aggregator_unavailable_apiservice_count[5m])) > 2Related metrics:
aggregator_unavailable_apiservice_count – unavailable custom APIService occurrences
4.4 KubeAPIDown – Detects when the API server is down or unreachable.
absent(up{job="apiserver"} == 1)5. Kubelet Metrics
5.1 KubeNodeNotReady – Checks if a node is not in Ready condition.
kube_node_status_condition{job="kube-state-metrics",condition="Ready",status="true"} == 0Related metrics:
kube_node_status_condition – node readiness status
5.2 KubeNodeUnreachable – Detects nodes marked as unschedulable.
kube_node_spec_unschedulable{job="kube-state-metrics"} == 15.3 KubeletTooManyPods – Alerts when a node runs close to its pod capacity (95%+).
max(max(kubelet_running_pod_count{job="kubelet",metrics_path="/metrics"}) by (instance) * on(instance) group_left(node) kubelet_node_name{job="kubelet",metrics_path="/metrics"} by (node) / max(kube_node_status_capacity_pods{job="kube-state-metrics"} != 1) by (node)) > 0.95Related metrics:
kubelet_running_pod_count – number of pods running on a node
kubelet_node_name – node name
kube_node_status_capacity_pods – maximum pod capacity per node
5.4 KubeNodeReadinessFlapping – Monitors frequency of node readiness state changes.
sum(changes(kube_node_status_condition{status="true",condition="Ready"}[15m])) by (node) > 25.5 KubeletDown – Detects when the kubelet service is down or unreachable.
absent(up{job="kubelet",metrics_path="/metrics"} == 1)6. Cluster Component Metrics
6.1 KubeSchedulerDown – Checks if the scheduler is down.
absent(up{job="kube-scheduler"} == 1)6.2 KubeControllerManagerDown – Checks if the controller‑manager is down.
absent(up{job="kube-controller-manager"} == 1)7. Application‑Level Metrics
7.1 KubePodCrashLooping – Alerts when a pod restarts more than once within 5 minutes (restart rate > 0).
rate(kube_pod_container_status_restarts_total{job="kube-state-metrics"}[5m]) * 60 * 3 > 0Related metrics:
kube_pod_container_status_restarts_total – restart count per container
7.2 KubePodNotReady – Detects pods that are not ready (Pending or Unknown).
sum by (namespace, pod) (max by (namespace, pod) (kube_pod_status_phase{job="kube-state-metrics",phase=~"Pending|Unknown"}) * on(namespace, pod) group_left(owner_kind) max by (namespace, pod, owner_kind) (kube_pod_owner{owner_kind!="Job"})) > 0Related metrics:
kube_pod_status_phase – pod phase status
7.3 KubeDeploymentGenerationMismatch – Detects when the observed generation of a Deployment differs from its metadata generation.
kube_deployment_status_observed_generation{job="kube-state-metrics"} != kube_deployment_metadata_generation{job="kube-state-metrics"}Related metrics:
kube_deployment_status_observed_generation – observed generation of a Deployment
kube_deployment_metadata_generation – desired generation of a Deployment
7.4 KubeDeploymentReplicasMismatch – Alerts when the number of available replicas does not match the desired replica count.
(kube_deployment_spec_replicas{job="kube-state-metrics"} != kube_deployment_status_replicas_available{job="kube-state-metrics"})
and (changes(kube_deployment_status_replicas_updated{job="kube-state-metrics"}[3m]) == 0)Related metrics:
kube_deployment_spec_replicas – desired replica count
kube_deployment_status_replicas_available – currently available replicas
kube_deployment_status_replicas_updated – updated replica count
7.5 KubeStatefulSetReplicasMismatch – Alerts when ready replicas differ from the desired replica count.
(kube_statefulset_status_replicas_ready{job="kube-state-metrics"} != kube_statefulset_status_replicas{job="kube-state-metrics"})
and (changes(kube_statefulset_status_replicas_updated{job="kube-state-metrics"}[5m]) == 0)Related metrics:
kube_statefulset_status_replicas_ready – ready replicas
kube_statefulset_status_replicas – current replicas
kube_statefulset_status_replicas_updated – updated replicas
7.6 KubeStatefulSetUpdateNotRolledOut – Detects failed StatefulSet updates that have not been rolled back.
max without (revision) (kube_statefulset_status_current_revision{job="kube-state-metrics"} unless kube_statefulset_status_update_revision{job="kube-state-metrics"})
* (kube_statefulset_replicas{job="kube-state-metrics"} != kube_statefulset_status_replicas_updated{job="kube-state-metrics"})Related metrics:
kube_statefulset_status_current_revision – current revision of the StatefulSet
kube_statefulset_status_update_revision – update revision of the StatefulSet
kube_statefulset_replicas – desired replica count
kube_statefulset_status_replicas_updated – updated replica count
7.7 KubeDaemonSetRolloutStuck – Checks if a DaemonSet has fewer ready pods than desired.
kube_daemonset_status_number_ready{job="kube-state-metrics"}
/
kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"} < 1.00Related metrics:
kube_daemonset_status_number_ready – ready DaemonSet pods
kube_daemonset_status_desired_number_scheduled – desired DaemonSet pods
7.8 KubeDaemonSetMisScheduled – Detects DaemonSet pods running on nodes where they should not be scheduled.
kube_daemonset_status_number_misscheduled{job="kube-state-metrics"} > 0Related metrics:
kube_daemonset_status_number_misscheduled – misscheduled DaemonSet pods
7.9 KubeContainerWaiting – Lists containers that are in a waiting state.
sum by (namespace, pod, container) (kube_pod_container_status_waiting_reason{job="kube-state-metrics"}) > 0Related metrics:
kube_pod_container_status_waiting_reason – waiting reason for containers
8. Node‑Level Metrics
8.1 NodeClockNotSynchronising – Detects loss of synchronization with time servers.
min_over_time(node_timex_sync_status[5m]) == 0Related metrics:
node_timex_sync_status – time synchronization status
8.2 NodeClockSkewDetected – Detects significant local clock offset.
(node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0)
or
(node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0)Related metrics:
node_timex_offset_seconds – clock offset in seconds
8.3 NodeHighNumberConntrackEntriesUsed – Alerts when conntrack usage exceeds 75% of its limit.
(node_nf_conntrack_entries / node_nf_conntrack_entries_limit) > 0.75Related metrics:
node_nf_conntrack_entries – allocated conntrack entries
node_nf_conntrack_entries_limit – total conntrack capacity
8.4 NodeNetworkReceiveErrs – Detects a surge in network receive errors.
increase(node_network_receive_errs_total[2m]) > 10Related metrics:
node_network_receive_errs_total – total receive errors
8.5 NodeNetworkTransmitErrs – Detects a surge in network transmit errors.
increase(node_network_transmit_errs_total[2m]) > 10Related metrics:
node_network_transmit_errs_total – total transmit errors
8.6 NodeFilesystemAlmostOutOfFiles – Alerts when free inodes drop below 5%.
(node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100) < 5
and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0Related metrics:
node_filesystem_files_free – free inodes
node_filesystem_files – total inodes
8.7 NodeFilesystemFilesFillingUp – Predicts inode exhaustion using a 6‑hour linear forecast.
(node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100) < 20
and predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0
and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0Related metrics:
node_filesystem_files_free – free inodes
node_filesystem_files – total inodes
8.8 NodeFilesystemAlmostOutOfSpace – Alerts when filesystem free space falls below 10%.
(node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100) < 10
and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0Related metrics:
node_filesystem_avail_bytes – free bytes
node_filesystem_size_bytes – total bytes
8.9 NodeFilesystemSpaceFillingUp – Predicts filesystem space exhaustion using a 6‑hour linear forecast.
(node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100) < 15
and predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0
and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0Related metrics:
node_filesystem_avail_bytes – free bytes
node_filesystem_size_bytes – total bytes
9. Etcd Metrics
9.1 EtcdLive – Checks if etcd instances are up.
up{job="etcd"} < 19.2 EtcdClusterUnavailable – Alerts when the number of down etcd members exceeds the tolerated fault count.
count(up{job="etcd"} == 0) > (count(up{job="etcd"}) / 2 - 1)9.3 EtcdLeaderCheck – Ensures a leader exists.
max(etcd_server_has_leader) != 19.4 EtcdBackendFsync – Monitors backend disk commit latency; alerts if 99th percentile exceeds 100 seconds.
histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket[5m]) by (instance, le))) > 1009.5 EtcdWalFsync – Monitors WAL fsync latency; alerts if 99th percentile exceeds 100 seconds.
histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]) by (instance, le))) > 1009.6 EtcdDbSize – Alerts when etcd database size exceeds 1 GiB.
etcd_debugging_mvcc_db_total_size_in_bytes / 1024 / 1024 > 10249.7 EtcdGrpc – Monitors gRPC request rate; alerts if rate exceeds 100 requests per second.
sum(rate(grpc_server_handled_total{grpc_type="unary"}[1m])) > 10010. CoreDNS Metrics
10.1 DnsRequest – Alerts when DNS query rate exceeds 100 queries per minute.
sum(irate(coredns_dns_request_count_total{zone != "dropped"}[1m])) > 100Related metrics:
coredns_dns_request_count_total – total DNS queries
10.2 DnsRequestFailed – Alerts on DNS responses with error codes other than NOERROR.
irate(coredns_dns_response_rcode_count_total{rcode != "NOERROR"}[1m]) > 0Related metrics:
coredns_dns_response_rcode_count_total – DNS response status codes
10.3 DnsPanic – Detects potential DNS attacks by monitoring panic count.
irate(coredns_panic_count_total[1m]) > 100Reference Links
https://my.oschina.net/54188zz/blog/4305978
https://github.com/coreos/kube-prometheus
https://github.com/kubernetes-monitoring/kubernetes-mixin
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
