Operations 23 min read

10 Essential PromQL Queries Every Ops Engineer Must Master

This article presents ten practical PromQL query examples covering CPU, memory, disk, network, application, database, Kubernetes, and business metrics, along with key concepts, alerting thresholds, and best‑practice tips to help operations engineers build a comprehensive monitoring system in cloud‑native environments.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
10 Essential PromQL Queries Every Ops Engineer Must Master

PromQL in Practice: 10 Essential Query Cases Every Ops Engineer Should Know

Preface: In the cloud‑native era, Prometheus has become the de‑facto standard for monitoring. As a seasoned operations engineer, I have seen many teams stumble on PromQL queries and suffer production incidents due to insufficient monitoring. Here are ten real‑world PromQL queries, each distilled from hard‑won experience.

🚀 Why PromQL Is a Must‑Have Skill for Ops Engineers?

With micro‑service architectures, system complexity grows exponentially. Traditional monitoring can no longer meet modern operational needs. PromQL, the query language of Prometheus, offers powerful time‑series processing, flexible label selectors, rich built‑in functions, and real‑time query capabilities.

Mastering PromQL not only improves work efficiency but also enables rapid problem isolation, preventing production incidents from escalating.

📊 Case 1: CPU Usage Monitoring & Alerting

Scenario

CPU usage is one of the most basic and important monitoring metrics. In production we need to monitor each node's CPU usage in real time and alert when usage is too high.

Query

# Query CPU usage of a single instance
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100)

# Query cluster‑wide CPU usage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Query per‑core usage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance, cpu) * 100)

# Alert when CPU usage > 80%
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 80

Key Points

rate()

function calculates the change rate of Counter‑type metrics. node_cpu_seconds_total is a cumulative metric; it must be processed with rate() to obtain real‑time usage. by (instance) groups results by instance. mode="idle" represents idle time; subtracting from 100 % yields usage.

Practical Advice

Warning level: CPU > 70 % for 5 minutes.

Severe level: CPU > 85 % for 2 minutes.

Emergency level: CPU > 95 % for 1 minute.

🧠 Case 2: Precise Memory Usage Calculation

Scenario

Memory monitoring is more complex than CPU because Linux memory management makes simple (total‑available)/total calculations inaccurate. We need to consider cache, buffers, etc.

Query

# Accurate memory usage on Linux
(
  (node_memory_MemTotal_bytes - node_memory_MemFree_bytes -
   node_memory_Buffers_bytes - node_memory_Cached_bytes) /
  node_memory_MemTotal_bytes
) * 100

# Memory available rate
(
  node_memory_MemAvailable_bytes /
  node_memory_MemTotal_bytes
) * 100

# Swap usage
(
  (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) /
  node_memory_SwapTotal_bytes
) * 100

# Memory pressure alert (usage>90% && Swap>50%)
(
  (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) /
  node_memory_MemTotal_bytes * 100 > 90
) and (
  (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) /
  node_memory_SwapTotal_bytes * 100 > 50
)

Key Points

node_memory_MemAvailable_bytes

is the most accurate available memory metric.

Cache and buffer memory can be reclaimed under pressure and should not be counted as used.

High swap usage usually indicates memory pressure.

Practical Advice

Pre‑alert: memory usage > 80 %.

Alert: memory usage > 90 % or swap usage > 30 %.

Severe alert: memory usage > 95 % and swap usage > 50 %.

💾 Case 3: Disk Space & I/O Monitoring

Scenario

Disk problems are one of the most common causes of production failures. Insufficient disk space prevents writes; high I/O degrades performance.

Query

# Disk space usage
(
  (node_filesystem_size_bytes - node_filesystem_free_bytes) /
  node_filesystem_size_bytes
) * 100

# Exclude virtual filesystems
(
  (node_filesystem_size_bytes{fstype!~"tmpfs|fuse.lxcfs|squashfs"} -
   node_filesystem_free_bytes{fstype!~"tmpfs|fuse.lxcfs|squashfs"}) /
  node_filesystem_size_bytes{fstype!~"tmpfs|fuse.lxcfs|squashfs"}
) * 100

# Disk I/O usage
rate(node_disk_io_time_seconds_total[5m]) * 100

# Disk read/write IOPS
rate(node_disk_reads_completed_total[5m]) + rate(node_disk_writes_completed_total[5m])

# Disk read/write bandwidth
rate(node_disk_read_bytes_total[5m]) + rate(node_disk_written_bytes_total[5m])

Key Points

Use fstype!~"tmpfs|fuse.lxcfs|squashfs" to filter out virtual filesystems. node_disk_io_time_seconds_total represents disk busy time.

IOPS and bandwidth are key performance indicators for disks.

Practical Advice

Space alert: usage > 85 %.

Performance alert: I/O usage > 80 % for 5 minutes.

Predictive alert: forecast when space will run out based on historical trends.

🌐 Case 4: Network Traffic & Connection Monitoring

Scenario

Network monitoring is crucial for web applications and micro‑service architectures. Network anomalies are often the first sign of service unavailability.

Query

# Incoming traffic (bytes/s)
rate(node_network_receive_bytes_total{device!~"lo|docker.*|veth.*"}[5m])

# Outgoing traffic (bytes/s)
rate(node_network_transmit_bytes_total{device!~"lo|docker.*|veth.*"}[5m])

# Incoming packet rate
rate(node_network_receive_packets_total{device!~"lo|docker.*|veth.*"}[5m])

# Network error rate
rate(node_network_receive_errs_total[5m]) + rate(node_network_transmit_errs_total[5m])

# TCP current connections
node_netstat_Tcp_CurrEstab

# TCP connection establishment rate
rate(node_netstat_Tcp_PassiveOpens[5m]) + rate(node_netstat_Tcp_ActiveOpens[5m])

Key Points

Filter out loopback and container interfaces with device!~"lo|docker.*|veth.*".

High network error rate often points to hardware or driver issues.

Excessive TCP connections can exhaust ports.

Practical Advice

Traffic monitoring: watch for spikes and abnormal patterns.

Connection count monitoring: prevent connection‑pool exhaustion.

Error rate monitoring: keep error rate near zero.

🔄 Case 5: Application Service Availability Monitoring

Scenario

For web applications we need to monitor service availability, response time, and error rate—metrics that directly affect user experience.

Query

# HTTP success rate
sum(rate(http_requests_total{status=~"2.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100

# HTTP error rate
sum(rate(http_requests_total{status=~"4..|5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100

# Request count by status code
sum(rate(http_requests_total[5m])) by (status)

# Average response time
sum(rate(http_request_duration_seconds_sum[5m])) /
sum(rate(http_request_duration_seconds_count[5m]))

# P95 response time
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Top‑10 slow endpoints
topk(10, sum(rate(http_request_duration_seconds_sum[5m])) by (endpoint) /
sum(rate(http_request_duration_seconds_count[5m])) by (endpoint)

Key Points

histogram_quantile()

calculates percentiles. topk() returns the top‑K results.

Monitoring request distribution by status helps quickly locate issues.

Practical Advice

Availability SLA: 99.9 % (annual downtime < 8.77 h).

Response time: P95 < 500 ms, P99 < 1 s.

Error rate: < 0.1 %.

🗄️ Case 6: Database Performance Monitoring

Scenario

Databases are core components; their performance directly impacts the whole system. We need to monitor connection count, query performance, lock wait, etc.

Query

# MySQL connection usage
mysql_global_status_threads_connected /
mysql_global_variables_max_connections * 100

# MySQL QPS
rate(mysql_global_status_queries[5m])

# MySQL slow query rate
rate(mysql_global_status_slow_queries[5m]) /
rate(mysql_global_status_queries[5m]) * 100

# MySQL buffer pool hit rate
(mysql_global_status_innodb_buffer_pool_read_requests -
 mysql_global_status_innodb_buffer_pool_reads) /
mysql_global_status_innodb_buffer_pool_read_requests * 100

# MySQL replication lag
mysql_slave_lag_seconds

# PostgreSQL active connections
pg_stat_activity_count{state="active"}

# PostgreSQL cache hit rate
pg_stat_database_blks_hit /
(pg_stat_database_blks_hit + pg_stat_database_blks_read) * 100

Key Points

Database connections should stay below 80 % of the max connections.

Slow query rate should be kept under 1 %.

Buffer pool hit rate should be above 95 %.

Practical Advice

Connection alert: > 80 % of max connections.

QPS monitoring: watch for sudden spikes or drops.

Slow query optimization: regularly analyze slow‑query logs.

📱 Case 7: Container & Kubernetes Monitoring

Scenario

In containerized environments we need to monitor pod resource usage, container status, and cluster health.

Query

# Pod CPU usage
sum(rate(container_cpu_usage_seconds_total{pod!=""}[5m])) by (pod) /
sum(container_spec_cpu_quota{pod!=""}/container_spec_cpu_period{pod!=""}) by (pod) * 100

# Pod memory usage
sum(container_memory_usage_bytes{pod!=""}) by (pod) /
sum(container_spec_memory_limit_bytes{pod!=""}) by (pod) * 100

# Number of pods per namespace
count(kube_pod_info) by (namespace)

# Unhealthy pod count
count(kube_pod_status_phase{phase!="Running"})

# Node allocatable pod slots
kube_node_status_allocatable{resource="pods"}

# Deployment replica availability
kube_deployment_status_replicas_available /
kube_deployment_spec_replicas

# PVC usage rate
(kubelet_volume_stats_used_bytes /
kubelet_volume_stats_capacity_bytes) * 100

Key Points

Container metrics need to filter out system containers.

Kubernetes state metrics are collected via kube‑state‑metrics.

Monitoring resource quotas and usage is vital for cluster stability.

Practical Advice

Resource monitoring: prevent resource contention.

Pod status monitoring: detect abnormal pods promptly.

Cluster capacity planning: forecast resource needs based on history.

⚡ Case 8: Application Performance Metrics (APM) Monitoring

Scenario

Beyond infrastructure, we need to monitor application‑level metrics such as JVM performance, garbage collection, thread pools, etc.

Query

# JVM heap usage
jvm_memory_bytes_used{area="heap"} /
jvm_memory_bytes_max{area="heap"} * 100

# JVM GC frequency
rate(jvm_gc_collection_seconds_count[5m])

# JVM GC average duration
rate(jvm_gc_collection_seconds_sum[5m]) /
rate(jvm_gc_collection_seconds_count[5m])

# Active thread count
jvm_threads_current{state="runnable"}

# Application start time
process_start_time_seconds

# Classes loaded
jvm_classes_loaded

# Application throughput (TPS)
sum(rate(method_timed_sum[5m])) by (application)

Key Points

JVM metrics require the application to expose a metrics library.

Long GC pauses degrade response time.

Too many threads can cause context‑switch overhead.

Practical Advice

Heap usage: keep between 70 %–80 %.

GC frequency: stay within reasonable limits.

GC pause: single pause < 100 ms.

🔍 Case 9: Business Metric Monitoring & Anomaly Detection

Scenario

Technical metrics are only part of monitoring; business metrics reflect the true health of the system. We need to define suitable business metrics per scenario.

Query

# User registration success rate
sum(rate(user_registration_total{status="success"}[5m])) /
sum(rate(user_registration_total[5m])) * 100

# Order payment success rate
sum(rate(payment_total{status="success"}[5m])) /
sum(rate(payment_total[5m])) * 100

# API traffic anomaly detection (based on historical data)
(
 sum(rate(http_requests_total[5m])) -
 avg_over_time(sum(rate(http_requests_total[5m]))[1d:5m])
) / avg_over_time(sum(rate(http_requests_total[5m]))[1d:5m]) * 100 > 50

# User activity monitoring
increase(active_users_total[1h])

# Error log growth rate
rate(log_messages_total{level="error"}[5m])

# Business metric trend prediction (using predict_linear)
predict_linear(revenue_total[1h], 3600)

Key Points

Business metrics must be reported by the application. predict_linear() can be used for simple trend prediction. avg_over_time() calculates the average over a time window.

Practical Advice

Core business processes: each critical step needs monitoring.

Anomaly detection: establish baselines from historical data.

Real‑time alerts: business metric anomalies require immediate response.

🎯 Case 10: Comprehensive Dashboard & SLI/SLO Monitoring

Scenario

Integrate various metrics into a comprehensive dashboard for end‑to‑end monitoring, and establish SLI/SLO‑based service quality guarantees.

Query

# Overall system health score
(
 # CPU weight 0.3
 (100 - avg(100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100))) * 0.3 +
 # Memory weight 0.3
 (100 - avg((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100)) * 0.3 +
 # Network weight 0.2
 (100 - avg(rate(node_network_receive_errs_total[5m]) + rate(node_network_transmit_errs_total[5m])) * 100) * 0.2 +
 # Application weight 0.2
 (avg(sum(rate(http_requests_total{status=~"2.."}[5m])) / sum(rate(http_requests_total[5m])) * 100)) * 0.2
)

# SLI: availability
avg_over_time(
 (sum(rate(http_requests_total{status!~"5.."}[5m])) /
 sum(rate(http_requests_total[5m])))
)[30d:5m]) * 100

# SLI: latency (P99 < 1s proportion)
avg_over_time(
 (sum(rate(http_request_duration_seconds_bucket{le="1.0"}[5m])) /
 sum(rate(http_request_duration_seconds_count[5m])))
)[30d:5m]) * 100

# Error budget consumption rate
(1 -
 sum(rate(http_requests_total{status!~"5.."}[5m])) /
 sum(rate(http_requests_total[5m]))) / (1 - 0.999) * 100

# Alert fatigue monitoring
sum(increase(prometheus_notifications_total[1d])) by (alertname)

Key Points

Overall health score requires reasonable weighting of each metric.

SLO settings should be based on business needs and historical data.

Error budget is a crucial tool to balance reliability and development speed.

Practical Advice

Start simple: define core SLIs first.

Set realistic targets: SLOs should be challenging yet achievable.

Review regularly: adjust SLOs as the business evolves.

📈 Monitoring System Construction Summary

Through the ten practical cases above, we have built a complete monitoring system:

Infrastructure monitoring: CPU, memory, disk, network.

Application layer monitoring: service availability, performance metrics.

Database monitoring: connections, query performance.

Container monitoring: pod status, resource usage.

Business metric monitoring: key business processes.

Comprehensive performance dashboard: end‑to‑end view.

Key Construction Points

Layered monitoring: from infrastructure to business.

Prevention first: use monitoring to predict and avoid failures.

Rapid response: establish effective alerting mechanisms.

Continuous improvement: refine monitoring strategies based on real‑world feedback.

🎉 Conclusion

As an operations engineer, mastering PromQL is not just about learning a few query statements; the real value lies in making system state observable, predictable, and controllable.

In today’s digital transformation, ops engineers are no longer simple “fire‑fighters”; they become guarantors of business stability, optimizers of system performance, and preventers of technical risk.

If you found this article helpful, please like, bookmark, and follow! More operational practice will be shared continuously.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

KubernetesPromQL
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.