Operations 22 min read

10 Essential PromQL Queries Every Ops Engineer Should Master

This article presents ten practical PromQL query examples covering CPU, memory, disk, network, database, Kubernetes, and business metrics, explains the underlying concepts, provides alert thresholds and best‑practice tips, and includes advanced optimization and alert‑rule design guidance for reliable monitoring.

Raymond Ops
Raymond Ops
Raymond Ops
10 Essential PromQL Queries Every Ops Engineer Should Master

Why PromQL Is a Must‑Have Skill for Operations Engineers

In the cloud‑native era Prometheus has become the de‑facto standard for monitoring, and PromQL offers powerful time‑series processing, flexible label selectors, a rich set of built‑in functions, and real‑time query capabilities that dramatically improve troubleshooting speed and reduce production incidents.

Case 1: CPU Usage Monitoring and Alerting

Scenario

CPU utilization is a fundamental metric; we need to watch each node and trigger alerts when usage spikes.

Query

# Query single instance CPU usage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100)

# Query cluster‑wide CPU usage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Query per‑core usage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance, cpu) * 100)

# Alert when usage exceeds 80%
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 80

Key Points

rate()

computes the per‑second increase of a counter metric. node_cpu_seconds_total is cumulative; applying rate() yields real‑time usage. by (instance) aggregates results per host.

Subtracting idle time from 100 % gives active CPU usage.

Practical Advice

Warning level: CPU > 70 % for 5 minutes.

Critical level: CPU > 85 % for 2 minutes.

Emergency level: CPU > 95 % for 1 minute.

Case 2: Accurate Memory Usage Calculation

Scenario

Simple "total‑minus‑available" formulas are inaccurate; we must consider caches and buffers.

Query

# Precise memory usage for Linux
(
  (node_memory_MemTotal_bytes - node_memory_MemFree_bytes -
   node_memory_Buffers_bytes - node_memory_Cached_bytes) /
  node_memory_MemTotal_bytes
) * 100

# Memory available rate
(node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

# Swap usage
(node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes * 100

# Memory pressure alert (usage>90% && swap>50%)
(
  (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
) and (
  (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes * 100 > 50
)

Key Points

node_memory_MemAvailable_bytes

is the most reliable free‑memory metric.

Cache and buffer memory can be reclaimed; they should not be counted as used.

High swap usage usually indicates memory pressure.

Practical Advice

Pre‑warning: memory usage > 80 %.

Alert: memory usage > 90 % or swap usage > 30 %.

Severe alert: memory usage > 95 % and swap usage > 50 %.

Case 3: Disk Space and I/O Monitoring

Scenario

Disk exhaustion and high I/O are common production failures.

Query

# Disk space usage
(node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100

# Exclude virtual filesystems
(node_filesystem_size_bytes{fstype!~"tmpfs|fuse.lxcfs|squashfs"} - node_filesystem_free_bytes{fstype!~"tmpfs|fuse.lxcfs|squashfs"}) / node_filesystem_size_bytes{fstype!~"tmpfs|fuse.lxcfs|squashfs"} * 100

# Disk I/O utilization
rate(node_disk_io_time_seconds_total[5m]) * 100

# IOPS (reads + writes)
rate(node_disk_reads_completed_total[5m]) + rate(node_disk_writes_completed_total[5m])

# Bandwidth (bytes per second)
rate(node_disk_read_bytes_total[5m]) + rate(node_disk_written_bytes_total[5m])

Key Points

Use fstype!~"tmpfs|fuse.lxcfs|squashfs" to filter out virtual filesystems. node_disk_io_time_seconds_total represents the time the disk is busy.

IOPS and bandwidth are essential performance indicators.

Practical Advice

Space alert: usage > 85 %.

Performance alert: I/O utilization > 80 % for 5 minutes.

Predictive alert: forecast when space will run out based on historical trends.

Case 4: Network Traffic and Connection Count Monitoring

Scenario

Network anomalies are often the first sign of service unavailability in micro‑service architectures.

Query

# Incoming traffic (bytes/s)
rate(node_network_receive_bytes_total{device!~"lo|docker.*|veth.*"}[5m])

# Outgoing traffic (bytes/s)
rate(node_network_transmit_bytes_total{device!~"lo|docker.*|veth.*"}[5m])

# Packet rate
rate(node_network_receive_packets_total{device!~"lo|docker.*|veth.*"}[5m])

# Error packet rate
rate(node_network_receive_errs_total[5m]) + rate(node_network_transmit_errs_total[5m])

# TCP current connections
node_netstat_Tcp_CurrEstab

# TCP connection establishment rate
rate(node_netstat_Tcp_PassiveOpens[5m]) + rate(node_netstat_Tcp_ActiveOpens[5m])

Key Points

Filter out loopback and container interfaces with device!~"lo|docker.*|veth.*".

High error‑packet rates usually indicate hardware or driver problems.

Excessive TCP connections can exhaust ports.

Practical Advice

Monitor traffic spikes and abnormal patterns.

Watch connection counts to prevent pool exhaustion.

Keep network error rate close to zero.

Case 5: Application Service Availability Monitoring

Scenario

Web services need to track availability, response time, and error rate to ensure good user experience.

Query

# HTTP success rate
sum(rate(http_requests_total{status=~"2.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

# HTTP error rate
sum(rate(http_requests_total{status=~"4..|5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

# Requests per status code
sum(rate(http_requests_total[5m])) by (status)

# Average response time
sum(rate(http_request_duration_seconds_sum[5m])) / sum(rate(http_request_duration_seconds_count[5m]))

# P95 response time
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Top‑10 slow endpoints
topk(10, sum(rate(http_request_duration_seconds_sum[5m])) by (endpoint) / sum(rate(http_request_duration_seconds_count[5m])) by (endpoint))

Key Points

histogram_quantile()

calculates percentile latency. topk() returns the highest‑ranking results.

Monitoring status‑code distribution helps locate problems quickly.

Practical Advice

Availability SLA: 99.9 % (annual downtime < 8.77 h).

P95 latency < 500 ms, P99 < 1 s.

Error rate < 0.1 %.

Case 6: Database Performance Monitoring

Scenario

Databases are core components; we must watch connections, query throughput, and lock waits.

Query

# MySQL connection usage
mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100

# MySQL QPS
rate(mysql_global_status_queries[5m])

# MySQL slow‑query rate
rate(mysql_global_status_slow_queries[5m]) / rate(mysql_global_status_queries[5m]) * 100

# InnoDB buffer‑pool hit rate
(mysql_global_status_innodb_buffer_pool_read_requests - mysql_global_status_innodb_buffer_pool_reads) / mysql_global_status_innodb_buffer_pool_read_requests * 100

# MySQL replication lag
mysql_slave_lag_seconds

# PostgreSQL active connections
pg_stat_activity_count{state="active"}

# PostgreSQL cache hit rate
pg_stat_database_blks_hit / (pg_stat_database_blks_hit + pg_stat_database_blks_read) * 100

Key Points

Database connections should stay below 80 % of the max limit.

Slow‑query rate should be under 1 %.

Buffer‑pool hit rate should exceed 95 %.

Practical Advice

Alert when connections > 80 % of max.

Watch QPS spikes or drops.

Regularly analyse slow‑query logs.

Case 7: Container & Kubernetes Monitoring

Scenario

In containerised environments we need to monitor pod resource usage, container health, and cluster state.

Query

# Pod CPU usage
sum(rate(container_cpu_usage_seconds_total{pod!=""}[5m])) by (pod) / sum(container_spec_cpu_quota{pod!=""}/container_spec_cpu_period{pod!=""}) by (pod) * 100

# Pod memory usage
sum(container_memory_usage_bytes{pod!=""}) by (pod) / sum(container_spec_memory_limit_bytes{pod!=""}) by (pod) * 100

# Pods per namespace
count(kube_pod_info) by (namespace)

# Unhealthy pods count
count(kube_pod_status_phase{phase!="Running"})

# Schedulable pods per node
kube_node_status_allocatable{resource="pods"}

# Deployment replica availability
kube_deployment_status_replicas_available / kube_deployment_spec_replicas

# PVC usage
(kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100

Key Points

Filter out system containers when querying container metrics.

Kube‑state‑metrics provides cluster‑level state information.

Monitoring resource quotas and usage is crucial for cluster stability.

Practical Advice

Resource monitoring to avoid contention.

Pod status monitoring for rapid anomaly detection.

Capacity planning based on historical usage.

Case 8: Application‑Level (APM) Monitoring

Scenario

Beyond infrastructure we must observe JVM metrics, garbage‑collection behavior, thread pools, etc.

Query

# JVM heap usage
jvm_memory_bytes_used{area="heap"} / jvm_memory_bytes_max{area="heap"} * 100

# GC frequency
rate(jvm_gc_collection_seconds_count[5m])

# Avg GC duration
rate(jvm_gc_collection_seconds_sum[5m]) / rate(jvm_gc_collection_seconds_count[5m])

# Runnable threads
jvm_threads_current{state="runnable"}

# Process start time
process_start_time_seconds

# Loaded classes count
jvm_classes_loaded

# Application throughput (TPS)
sum(rate(method_timed_sum[5m])) by (application)

Key Points

JVM metrics require the application to expose a metrics library.

Long GC pauses degrade response time.

Too many threads increase context‑switch overhead.

Practical Advice

Keep heap usage between 70 %‑80 %.

Control GC frequency and keep single GC pause < 100 ms.

Case 9: Business‑Metric Monitoring & Anomaly Detection

Scenario

Technical metrics alone do not reflect true system health; business‑level KPIs are needed.

Query

# User registration success rate
sum(rate(user_registration_total{status="success"}[5m])) / sum(rate(user_registration_total[5m])) * 100

# Order payment success rate
sum(rate(payment_total{status="success"}[5m])) / sum(rate(payment_total[5m])) * 100

# API traffic anomaly (based on historical baseline)
(
  sum(rate(http_requests_total[5m])) -
  avg_over_time(sum(rate(http_requests_total[5m]))[1d:5m])
) / avg_over_time(sum(rate(http_requests_total[5m]))[1d:5m]) * 100 > 50

# Active users (hourly increase)
increase(active_users_total[1h])

# Error‑log growth rate
rate(log_messages_total{level="error"}[5m])

# Revenue trend prediction
predict_linear(revenue_total[1h], 3600)

Key Points

Business metrics must be emitted by the application. predict_linear() provides simple trend forecasting. avg_over_time() calculates a moving average for baseline comparison.

Practical Advice

Instrument every critical business step.

Build baselines from historical data for anomaly detection.

Trigger immediate alerts when business KPIs deviate.

Case 10: Unified Dashboard & SLI/SLO Monitoring

Scenario

Combine all metrics into a single health‑score dashboard and define service‑level objectives.

Query

# Overall system health score (weighted)
(
  # CPU weight 0.3
  (100 - avg(100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100))) * 0.3 +
  # Memory weight 0.3
  (100 - avg((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100)) * 0.3 +
  # Network weight 0.2
  (100 - avg(rate(node_network_receive_errs_total[5m]) + rate(node_network_transmit_errs_total[5m])) * 100) * 0.2 +
  # Application weight 0.2
  (avg(sum(rate(http_requests_total{status=~"2.."}[5m])) / sum(rate(http_requests_total[5m])) * 100)) * 0.2
)

# SLI – availability (30‑day window)
avg_over_time((sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) [30d:5m]) * 100

# SLI – latency (P99 < 1 s proportion)
avg_over_time((sum(rate(http_request_duration_seconds_bucket{le="1.0"}[5m])) / sum(rate(http_request_duration_seconds_count[5m]))) [30d:5m]) * 100

# Error‑budget consumption rate
(1 - sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) / (1 - 0.999) * 100

# Alert fatigue monitoring
sum(increase(prometheus_notifications_total[1d])) by (alertname)

Key Points

Health scores require sensible weighting of each sub‑metric.

SLO targets must be based on business needs and historical data.

Error‑budget tracking balances reliability with development velocity.

Practical Advice

Start with a few core SLIs before expanding.

Set challenging yet achievable SLOs.

Review and adjust SLOs regularly as the product evolves.

PromQL Advanced Tips & Best Practices

1. Query Optimization Techniques

# Pre‑compute complex metrics with recording rules
groups:
  - name: cpu_utilization
    rules:
    - record: instance:node_cpu_utilization:rate5m
      expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100)

# Narrow query range with label selectors
sum(rate(http_requests_total{service="api", environment="prod"}[5m]))

# Use aggregation wisely
sum by (instance)(rate(node_cpu_seconds_total[5m]))

2. Alert‑Rule Design Principles

# Include a "for" clause to avoid flapping alerts
- alert: HighCPUUsage
  expr: instance:node_cpu_utilization:rate5m > 80
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High CPU usage on {{ $labels.instance }}"
    description: "CPU usage is {{ $value }}% for more than 5 minutes"

# Use increase/decrease functions for trend‑based alerts
- alert: DiskSpaceRunningOut
  expr: predict_linear(node_filesystem_free_bytes[1h], 4*3600) < 0
  for: 5m
  labels:
    severity: critical

3. Performance Optimization Suggestions

Use recording rules to pre‑compute expensive expressions.

Adjust scrape intervals according to metric volatility.

Apply label filters to reduce the number of time series.

Avoid high‑cardinality labels to keep memory usage low.

By following the above ten real‑world cases and the advanced tips, operations engineers can build a comprehensive, reliable, and scalable monitoring system that turns raw metrics into actionable insights.

MonitoringobservabilityMetricsalertingPrometheusPromQL
Raymond Ops
Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.