Cloud Native 24 min read

Essential Prometheus Operator Metrics for Kubernetes: Prevent Alert Overload

This guide explains the most common Prometheus Operator metrics for Kubernetes, detailing each metric's purpose, the PromQL expression to monitor it, and the related underlying metrics, helping you fine‑tune alerts and avoid unnecessary noise in your cluster monitoring.

MaGe Linux Operations

Oct 6, 2020

Essential Prometheus Operator Metrics for Kubernetes: Prevent Alert Overload

After installing the Prometheus Operator, many default monitoring metrics can generate a large number of alerts if not carefully managed, so it is essential to understand these common metrics and adjust them as needed.

1. Kubernetes Resource Metrics

1.1 CPUThrottlingHigh – Checks the reasonableness of CPU limits by finding containers where more than 25% of CPU cycles were throttled in the last 5 minutes.

sum(increase(container_cpu_cfs_throttled_periods_total{container!=""}[5m])) by (container, pod, namespace)
  /
sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace)
  > (25 / 100)

Related metrics:

container_cpu_cfs_periods_total – total number of CPU periods in a container's lifetime

container_cpu_cfs_throttled_periods_total – total number of throttled CPU periods in a container's lifetime

1.2 KubeCPUOvercommit – Detects cluster‑wide CPU over‑commitment that could make the cluster intolerant to node failures.

sum(namespace:kube_pod_container_resource_requests_cpu_cores:sum{})
  /
sum(kube_node_status_allocatable_cpu_cores)
  > (count(kube_node_status_allocatable_cpu_cores)-1) / count(kube_node_status_allocatable_cpu_cores)

Related metrics:

kube_pod_container_resource_requests_cpu_cores – CPU cores requested by pods

kube_node_status_allocatable_cpu_cores – total allocatable CPU cores per node

1.3 KubeMemoryOvercommit – Detects memory over‑commitment that could affect cluster stability.

sum(namespace:kube_pod_container_resource_requests_memory_bytes:sum{})
  /
sum(kube_node_status_allocatable_memory_bytes)
  > (count(kube_node_status_allocatable_memory_bytes)-1) / count(kube_node_status_allocatable_memory_bytes)

Related metrics:

kube_pod_container_resource_requests_memory_bytes – memory requested by pods

kube_node_status_allocatable_memory_bytes – total allocatable memory per node

1.4 KubeCPUQuotaOvercommit – Checks whether the total CPU limits exceed the cluster’s total CPU capacity.

sum(kube_pod_container_resource_limits_cpu_cores{job="kube-state-metrics"})
  /
sum(kube_node_status_allocatable_cpu_cores)
  > 1.1

Related metrics:

kube_pod_container_resource_limits_cpu_cores – CPU limits set for pods

kube_node_status_allocatable_cpu_cores – total allocatable CPU cores per node

1.5 KubeMemoryQuotaOvercommit – Checks whether the total memory limits exceed the cluster’s total memory capacity.

sum(kube_pod_container_resource_limits_memory_bytes{job="kube-state-metrics"})
  /
sum(kube_node_status_allocatable_memory_bytes{job="kube-state-metrics"})
  > 1.1

Related metrics:

kube_pod_container_resource_limits_memory_bytes – memory limits set for pods

kube_node_status_allocatable_memory_bytes – total allocatable memory per node

1.6 KubeMEMQuotaExceeded – Namespace‑level memory usage ratio; alerts when memory requests approach limits.

sum(kube_pod_container_resource_requests_memory_bytes{job="kube-state-metrics"}) by (namespace)
  / (sum(kube_pod_container_resource_limits_memory_bytes{job="kube-state-metrics"}) by (namespace))
  > 0.8

Related metrics:

kube_pod_container_resource_requests_memory_bytes – memory requested

kube_pod_container_resource_limits_memory_bytes – memory limit

1.7 KubeCPUQuotaExceeded – Namespace‑level CPU usage ratio; alerts when CPU requests approach limits.

sum(kube_pod_container_resource_requests_cpu_cores{job="kube-state-metrics"}) by (namespace)
  / (sum(kube_pod_container_resource_limits_cpu_cores{job="kube-state-metrics"}) by (namespace))
  > 0.8

Related metrics:

kube_pod_container_resource_requests_cpu_cores – CPU requested

kube_pod_container_resource_limits_cpu_cores – CPU limit

2. Kubernetes Storage Metrics

2.1 KubePersistentVolumeFillingUp – Monitors PVC capacity usage; alerts when available space falls below 30% of total.

kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics"}
  /
kubelet_volume_stats_capacity_bytes{job="kubelet",metrics_path="/metrics"}
  < 0.3

Related metrics:

kubelet_volume_stats_available_bytes – free space

kubelet_volume_stats_capacity_bytes – total capacity

2.2 KubePersistentVolumeFillingUp (prediction) – Predicts disk exhaustion using a 6‑hour rate; alerts if projected free space will be negative within 4 days.

(kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics"}
  /
kubelet_volume_stats_capacity_bytes{job="kubelet",metrics_path="/metrics"}) < 0.4
  and
predict_linear(kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics"}[6h], 4*24*3600) < 0

Related metrics:

kubelet_volume_stats_available_bytes – free space

kubelet_volume_stats_capacity_bytes – total capacity

2.3 KubePersistentVolumeErrors – Detects PersistentVolumes in Failed or Pending phases.

kube_persistentvolume_status_phase{phase=~"Failed|Pending",job="kube-state-metrics"}

Related metrics:

kube_persistentvolume_status_phase – PV status

3. Kubernetes System Metrics

3.1 KubeVersionMismatch – Checks if component versions differ from the cluster version.

count(count by (gitVersion) (label_replace(kubernetes_build_info{job!~"kube-dns|coredns"},"gitVersion","$1","gitVersion","(v[0-9]*.[0-9]*.[0-9]*).*")))

Related metrics:

kubernetes_build_info – component version information

3.2 KubeClientErrors – Client request error rate (5xx responses) over the last 5 minutes.

(sum(rate(rest_client_requests_total{code=~"5.."}[5m])) by (instance, job)
  /
sum(rate(rest_client_requests_total[5m])) by (instance, job))
  > 0.01

Related metrics:

rest_client_requests_total – HTTP status codes of client requests

4. APIServer Metrics

4.1 KubeAPIErrorsHigh – API server request error rate (5xx) over 5 minutes.

sum(rate(apiserver_request_total{job="apiserver",code=~"5.."}[5m])) by (resource,subresource,verb)
  /
sum(rate(apiserver_request_total{job="apiserver"}[5m])) by (resource,subresource,verb)
  > 0.05

Related metrics:

apiserver_request_total – total API server requests

4.2 KubeClientCertificateExpiration – Alerts when client certificates expire within 30 days or 7 days.

apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 2592000
apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 604800

Related metrics:

apiserver_client_certificate_expiration_seconds_count – remaining certificate validity

4.3 AggregatedAPIErrors – Monitors custom‑registered API services; alerts if unavailable count exceeds 2 in 5 minutes.

sum by (name, namespace) (increase(aggregator_unavailable_apiservice_count[5m])) > 2

Related metrics:

aggregator_unavailable_apiservice_count – unavailable custom APIService occurrences

4.4 KubeAPIDown – Detects when the API server is down or unreachable.

absent(up{job="apiserver"} == 1)

5. Kubelet Metrics

5.1 KubeNodeNotReady – Checks if a node is not in Ready condition.

kube_node_status_condition{job="kube-state-metrics",condition="Ready",status="true"} == 0

Related metrics:

kube_node_status_condition – node readiness status

5.2 KubeNodeUnreachable – Detects nodes marked as unschedulable.

kube_node_spec_unschedulable{job="kube-state-metrics"} == 1

5.3 KubeletTooManyPods – Alerts when a node runs close to its pod capacity (95%+).

max(max(kubelet_running_pod_count{job="kubelet",metrics_path="/metrics"}) by (instance) * on(instance) group_left(node) kubelet_node_name{job="kubelet",metrics_path="/metrics"} by (node) / max(kube_node_status_capacity_pods{job="kube-state-metrics"} != 1) by (node)) > 0.95

Related metrics:

kubelet_running_pod_count – number of pods running on a node

kubelet_node_name – node name

kube_node_status_capacity_pods – maximum pod capacity per node

5.4 KubeNodeReadinessFlapping – Monitors frequency of node readiness state changes.

sum(changes(kube_node_status_condition{status="true",condition="Ready"}[15m])) by (node) > 2

5.5 KubeletDown – Detects when the kubelet service is down or unreachable.

absent(up{job="kubelet",metrics_path="/metrics"} == 1)

6. Cluster Component Metrics

6.1 KubeSchedulerDown – Checks if the scheduler is down.

absent(up{job="kube-scheduler"} == 1)

6.2 KubeControllerManagerDown – Checks if the controller‑manager is down.

absent(up{job="kube-controller-manager"} == 1)

7. Application‑Level Metrics

7.1 KubePodCrashLooping – Alerts when a pod restarts more than once within 5 minutes (restart rate > 0).

rate(kube_pod_container_status_restarts_total{job="kube-state-metrics"}[5m]) * 60 * 3 > 0

Related metrics:

kube_pod_container_status_restarts_total – restart count per container

7.2 KubePodNotReady – Detects pods that are not ready (Pending or Unknown).

sum by (namespace, pod) (max by (namespace, pod) (kube_pod_status_phase{job="kube-state-metrics",phase=~"Pending|Unknown"}) * on(namespace, pod) group_left(owner_kind) max by (namespace, pod, owner_kind) (kube_pod_owner{owner_kind!="Job"})) > 0

Related metrics:

kube_pod_status_phase – pod phase status

7.3 KubeDeploymentGenerationMismatch – Detects when the observed generation of a Deployment differs from its metadata generation.

kube_deployment_status_observed_generation{job="kube-state-metrics"} != kube_deployment_metadata_generation{job="kube-state-metrics"}

Related metrics:

kube_deployment_status_observed_generation – observed generation of a Deployment

kube_deployment_metadata_generation – desired generation of a Deployment

7.4 KubeDeploymentReplicasMismatch – Alerts when the number of available replicas does not match the desired replica count.

(kube_deployment_spec_replicas{job="kube-state-metrics"} != kube_deployment_status_replicas_available{job="kube-state-metrics"})
  and (changes(kube_deployment_status_replicas_updated{job="kube-state-metrics"}[3m]) == 0)

Related metrics:

kube_deployment_spec_replicas – desired replica count

kube_deployment_status_replicas_available – currently available replicas

kube_deployment_status_replicas_updated – updated replica count

7.5 KubeStatefulSetReplicasMismatch – Alerts when ready replicas differ from the desired replica count.

(kube_statefulset_status_replicas_ready{job="kube-state-metrics"} != kube_statefulset_status_replicas{job="kube-state-metrics"})
  and (changes(kube_statefulset_status_replicas_updated{job="kube-state-metrics"}[5m]) == 0)

Related metrics:

kube_statefulset_status_replicas_ready – ready replicas

kube_statefulset_status_replicas – current replicas

kube_statefulset_status_replicas_updated – updated replicas

7.6 KubeStatefulSetUpdateNotRolledOut – Detects failed StatefulSet updates that have not been rolled back.

max without (revision) (kube_statefulset_status_current_revision{job="kube-state-metrics"} unless kube_statefulset_status_update_revision{job="kube-state-metrics"})
  * (kube_statefulset_replicas{job="kube-state-metrics"} != kube_statefulset_status_replicas_updated{job="kube-state-metrics"})

Related metrics:

kube_statefulset_status_current_revision – current revision of the StatefulSet

kube_statefulset_status_update_revision – update revision of the StatefulSet

kube_statefulset_replicas – desired replica count

kube_statefulset_status_replicas_updated – updated replica count

7.7 KubeDaemonSetRolloutStuck – Checks if a DaemonSet has fewer ready pods than desired.

kube_daemonset_status_number_ready{job="kube-state-metrics"}
  /
kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"} < 1.00

Related metrics:

kube_daemonset_status_number_ready – ready DaemonSet pods

kube_daemonset_status_desired_number_scheduled – desired DaemonSet pods

7.8 KubeDaemonSetMisScheduled – Detects DaemonSet pods running on nodes where they should not be scheduled.

kube_daemonset_status_number_misscheduled{job="kube-state-metrics"} > 0

Related metrics:

kube_daemonset_status_number_misscheduled – misscheduled DaemonSet pods

7.9 KubeContainerWaiting – Lists containers that are in a waiting state.

sum by (namespace, pod, container) (kube_pod_container_status_waiting_reason{job="kube-state-metrics"}) > 0

Related metrics:

kube_pod_container_status_waiting_reason – waiting reason for containers

8. Node‑Level Metrics

8.1 NodeClockNotSynchronising – Detects loss of synchronization with time servers.

min_over_time(node_timex_sync_status[5m]) == 0

Related metrics:

node_timex_sync_status – time synchronization status

8.2 NodeClockSkewDetected – Detects significant local clock offset.

(node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0)
  or
(node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0)

Related metrics:

node_timex_offset_seconds – clock offset in seconds

8.3 NodeHighNumberConntrackEntriesUsed – Alerts when conntrack usage exceeds 75% of its limit.

(node_nf_conntrack_entries / node_nf_conntrack_entries_limit) > 0.75

Related metrics:

node_nf_conntrack_entries – allocated conntrack entries

node_nf_conntrack_entries_limit – total conntrack capacity

8.4 NodeNetworkReceiveErrs – Detects a surge in network receive errors.

increase(node_network_receive_errs_total[2m]) > 10

Related metrics:

node_network_receive_errs_total – total receive errors

8.5 NodeNetworkTransmitErrs – Detects a surge in network transmit errors.

increase(node_network_transmit_errs_total[2m]) > 10

Related metrics:

node_network_transmit_errs_total – total transmit errors

8.6 NodeFilesystemAlmostOutOfFiles – Alerts when free inodes drop below 5%.

(node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100) < 5
  and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0

Related metrics:

node_filesystem_files_free – free inodes

node_filesystem_files – total inodes

8.7 NodeFilesystemFilesFillingUp – Predicts inode exhaustion using a 6‑hour linear forecast.

(node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100) < 20
  and predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0
  and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0

Related metrics:

node_filesystem_files_free – free inodes

node_filesystem_files – total inodes

8.8 NodeFilesystemAlmostOutOfSpace – Alerts when filesystem free space falls below 10%.

(node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100) < 10
  and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0

Related metrics:

node_filesystem_avail_bytes – free bytes

node_filesystem_size_bytes – total bytes

8.9 NodeFilesystemSpaceFillingUp – Predicts filesystem space exhaustion using a 6‑hour linear forecast.

(node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100) < 15
  and predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0
  and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0

Related metrics:

node_filesystem_avail_bytes – free bytes

node_filesystem_size_bytes – total bytes

9. Etcd Metrics

9.1 EtcdLive – Checks if etcd instances are up.

up{job="etcd"} < 1

9.2 EtcdClusterUnavailable – Alerts when the number of down etcd members exceeds the tolerated fault count.

count(up{job="etcd"} == 0) > (count(up{job="etcd"}) / 2 - 1)

9.3 EtcdLeaderCheck – Ensures a leader exists.

max(etcd_server_has_leader) != 1

9.4 EtcdBackendFsync – Monitors backend disk commit latency; alerts if 99th percentile exceeds 100 seconds.

histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket[5m]) by (instance, le))) > 100

9.5 EtcdWalFsync – Monitors WAL fsync latency; alerts if 99th percentile exceeds 100 seconds.

histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]) by (instance, le))) > 100

9.6 EtcdDbSize – Alerts when etcd database size exceeds 1 GiB.

etcd_debugging_mvcc_db_total_size_in_bytes / 1024 / 1024 > 1024

9.7 EtcdGrpc – Monitors gRPC request rate; alerts if rate exceeds 100 requests per second.

sum(rate(grpc_server_handled_total{grpc_type="unary"}[1m])) > 100

10. CoreDNS Metrics

10.1 DnsRequest – Alerts when DNS query rate exceeds 100 queries per minute.

sum(irate(coredns_dns_request_count_total{zone != "dropped"}[1m])) > 100

Related metrics:

coredns_dns_request_count_total – total DNS queries

10.2 DnsRequestFailed – Alerts on DNS responses with error codes other than NOERROR.

irate(coredns_dns_response_rcode_count_total{rcode != "NOERROR"}[1m]) > 0

Related metrics:

coredns_dns_response_rcode_count_total – DNS response status codes

10.3 DnsPanic – Detects potential DNS attacks by monitoring panic count.

irate(coredns_panic_count_total[1m]) > 100

Reference Links

https://my.oschina.net/54188zz/blog/4305978

https://github.com/coreos/kube-prometheus

https://github.com/kubernetes-monitoring/kubernetes-mixin

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Cloud Native Kubernetes prometheus PromQL

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.