10 Essential PromQL Queries Every Ops Engineer Must Master
This article presents ten practical PromQL query examples covering CPU, memory, disk, network, application, database, Kubernetes, and business metrics, along with key concepts, alerting thresholds, and best‑practice tips to help operations engineers build a comprehensive monitoring system in cloud‑native environments.
PromQL in Practice: 10 Essential Query Cases Every Ops Engineer Should Know
Preface: In the cloud‑native era, Prometheus has become the de‑facto standard for monitoring. As a seasoned operations engineer, I have seen many teams stumble on PromQL queries and suffer production incidents due to insufficient monitoring. Here are ten real‑world PromQL queries, each distilled from hard‑won experience.
🚀 Why PromQL Is a Must‑Have Skill for Ops Engineers?
With micro‑service architectures, system complexity grows exponentially. Traditional monitoring can no longer meet modern operational needs. PromQL, the query language of Prometheus, offers powerful time‑series processing, flexible label selectors, rich built‑in functions, and real‑time query capabilities.
Mastering PromQL not only improves work efficiency but also enables rapid problem isolation, preventing production incidents from escalating.
📊 Case 1: CPU Usage Monitoring & Alerting
Scenario
CPU usage is one of the most basic and important monitoring metrics. In production we need to monitor each node's CPU usage in real time and alert when usage is too high.
Query
# Query CPU usage of a single instance
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100)
# Query cluster‑wide CPU usage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Query per‑core usage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance, cpu) * 100)
# Alert when CPU usage > 80%
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 80Key Points
rate()function calculates the change rate of Counter‑type metrics. node_cpu_seconds_total is a cumulative metric; it must be processed with rate() to obtain real‑time usage. by (instance) groups results by instance. mode="idle" represents idle time; subtracting from 100 % yields usage.
Practical Advice
Warning level: CPU > 70 % for 5 minutes.
Severe level: CPU > 85 % for 2 minutes.
Emergency level: CPU > 95 % for 1 minute.
🧠 Case 2: Precise Memory Usage Calculation
Scenario
Memory monitoring is more complex than CPU because Linux memory management makes simple (total‑available)/total calculations inaccurate. We need to consider cache, buffers, etc.
Query
# Accurate memory usage on Linux
(
(node_memory_MemTotal_bytes - node_memory_MemFree_bytes -
node_memory_Buffers_bytes - node_memory_Cached_bytes) /
node_memory_MemTotal_bytes
) * 100
# Memory available rate
(
node_memory_MemAvailable_bytes /
node_memory_MemTotal_bytes
) * 100
# Swap usage
(
(node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) /
node_memory_SwapTotal_bytes
) * 100
# Memory pressure alert (usage>90% && Swap>50%)
(
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) /
node_memory_MemTotal_bytes * 100 > 90
) and (
(node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) /
node_memory_SwapTotal_bytes * 100 > 50
)Key Points
node_memory_MemAvailable_bytesis the most accurate available memory metric.
Cache and buffer memory can be reclaimed under pressure and should not be counted as used.
High swap usage usually indicates memory pressure.
Practical Advice
Pre‑alert: memory usage > 80 %.
Alert: memory usage > 90 % or swap usage > 30 %.
Severe alert: memory usage > 95 % and swap usage > 50 %.
💾 Case 3: Disk Space & I/O Monitoring
Scenario
Disk problems are one of the most common causes of production failures. Insufficient disk space prevents writes; high I/O degrades performance.
Query
# Disk space usage
(
(node_filesystem_size_bytes - node_filesystem_free_bytes) /
node_filesystem_size_bytes
) * 100
# Exclude virtual filesystems
(
(node_filesystem_size_bytes{fstype!~"tmpfs|fuse.lxcfs|squashfs"} -
node_filesystem_free_bytes{fstype!~"tmpfs|fuse.lxcfs|squashfs"}) /
node_filesystem_size_bytes{fstype!~"tmpfs|fuse.lxcfs|squashfs"}
) * 100
# Disk I/O usage
rate(node_disk_io_time_seconds_total[5m]) * 100
# Disk read/write IOPS
rate(node_disk_reads_completed_total[5m]) + rate(node_disk_writes_completed_total[5m])
# Disk read/write bandwidth
rate(node_disk_read_bytes_total[5m]) + rate(node_disk_written_bytes_total[5m])Key Points
Use fstype!~"tmpfs|fuse.lxcfs|squashfs" to filter out virtual filesystems. node_disk_io_time_seconds_total represents disk busy time.
IOPS and bandwidth are key performance indicators for disks.
Practical Advice
Space alert: usage > 85 %.
Performance alert: I/O usage > 80 % for 5 minutes.
Predictive alert: forecast when space will run out based on historical trends.
🌐 Case 4: Network Traffic & Connection Monitoring
Scenario
Network monitoring is crucial for web applications and micro‑service architectures. Network anomalies are often the first sign of service unavailability.
Query
# Incoming traffic (bytes/s)
rate(node_network_receive_bytes_total{device!~"lo|docker.*|veth.*"}[5m])
# Outgoing traffic (bytes/s)
rate(node_network_transmit_bytes_total{device!~"lo|docker.*|veth.*"}[5m])
# Incoming packet rate
rate(node_network_receive_packets_total{device!~"lo|docker.*|veth.*"}[5m])
# Network error rate
rate(node_network_receive_errs_total[5m]) + rate(node_network_transmit_errs_total[5m])
# TCP current connections
node_netstat_Tcp_CurrEstab
# TCP connection establishment rate
rate(node_netstat_Tcp_PassiveOpens[5m]) + rate(node_netstat_Tcp_ActiveOpens[5m])Key Points
Filter out loopback and container interfaces with device!~"lo|docker.*|veth.*".
High network error rate often points to hardware or driver issues.
Excessive TCP connections can exhaust ports.
Practical Advice
Traffic monitoring: watch for spikes and abnormal patterns.
Connection count monitoring: prevent connection‑pool exhaustion.
Error rate monitoring: keep error rate near zero.
🔄 Case 5: Application Service Availability Monitoring
Scenario
For web applications we need to monitor service availability, response time, and error rate—metrics that directly affect user experience.
Query
# HTTP success rate
sum(rate(http_requests_total{status=~"2.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100
# HTTP error rate
sum(rate(http_requests_total{status=~"4..|5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100
# Request count by status code
sum(rate(http_requests_total[5m])) by (status)
# Average response time
sum(rate(http_request_duration_seconds_sum[5m])) /
sum(rate(http_request_duration_seconds_count[5m]))
# P95 response time
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Top‑10 slow endpoints
topk(10, sum(rate(http_request_duration_seconds_sum[5m])) by (endpoint) /
sum(rate(http_request_duration_seconds_count[5m])) by (endpoint)Key Points
histogram_quantile()calculates percentiles. topk() returns the top‑K results.
Monitoring request distribution by status helps quickly locate issues.
Practical Advice
Availability SLA: 99.9 % (annual downtime < 8.77 h).
Response time: P95 < 500 ms, P99 < 1 s.
Error rate: < 0.1 %.
🗄️ Case 6: Database Performance Monitoring
Scenario
Databases are core components; their performance directly impacts the whole system. We need to monitor connection count, query performance, lock wait, etc.
Query
# MySQL connection usage
mysql_global_status_threads_connected /
mysql_global_variables_max_connections * 100
# MySQL QPS
rate(mysql_global_status_queries[5m])
# MySQL slow query rate
rate(mysql_global_status_slow_queries[5m]) /
rate(mysql_global_status_queries[5m]) * 100
# MySQL buffer pool hit rate
(mysql_global_status_innodb_buffer_pool_read_requests -
mysql_global_status_innodb_buffer_pool_reads) /
mysql_global_status_innodb_buffer_pool_read_requests * 100
# MySQL replication lag
mysql_slave_lag_seconds
# PostgreSQL active connections
pg_stat_activity_count{state="active"}
# PostgreSQL cache hit rate
pg_stat_database_blks_hit /
(pg_stat_database_blks_hit + pg_stat_database_blks_read) * 100Key Points
Database connections should stay below 80 % of the max connections.
Slow query rate should be kept under 1 %.
Buffer pool hit rate should be above 95 %.
Practical Advice
Connection alert: > 80 % of max connections.
QPS monitoring: watch for sudden spikes or drops.
Slow query optimization: regularly analyze slow‑query logs.
📱 Case 7: Container & Kubernetes Monitoring
Scenario
In containerized environments we need to monitor pod resource usage, container status, and cluster health.
Query
# Pod CPU usage
sum(rate(container_cpu_usage_seconds_total{pod!=""}[5m])) by (pod) /
sum(container_spec_cpu_quota{pod!=""}/container_spec_cpu_period{pod!=""}) by (pod) * 100
# Pod memory usage
sum(container_memory_usage_bytes{pod!=""}) by (pod) /
sum(container_spec_memory_limit_bytes{pod!=""}) by (pod) * 100
# Number of pods per namespace
count(kube_pod_info) by (namespace)
# Unhealthy pod count
count(kube_pod_status_phase{phase!="Running"})
# Node allocatable pod slots
kube_node_status_allocatable{resource="pods"}
# Deployment replica availability
kube_deployment_status_replicas_available /
kube_deployment_spec_replicas
# PVC usage rate
(kubelet_volume_stats_used_bytes /
kubelet_volume_stats_capacity_bytes) * 100Key Points
Container metrics need to filter out system containers.
Kubernetes state metrics are collected via kube‑state‑metrics.
Monitoring resource quotas and usage is vital for cluster stability.
Practical Advice
Resource monitoring: prevent resource contention.
Pod status monitoring: detect abnormal pods promptly.
Cluster capacity planning: forecast resource needs based on history.
⚡ Case 8: Application Performance Metrics (APM) Monitoring
Scenario
Beyond infrastructure, we need to monitor application‑level metrics such as JVM performance, garbage collection, thread pools, etc.
Query
# JVM heap usage
jvm_memory_bytes_used{area="heap"} /
jvm_memory_bytes_max{area="heap"} * 100
# JVM GC frequency
rate(jvm_gc_collection_seconds_count[5m])
# JVM GC average duration
rate(jvm_gc_collection_seconds_sum[5m]) /
rate(jvm_gc_collection_seconds_count[5m])
# Active thread count
jvm_threads_current{state="runnable"}
# Application start time
process_start_time_seconds
# Classes loaded
jvm_classes_loaded
# Application throughput (TPS)
sum(rate(method_timed_sum[5m])) by (application)Key Points
JVM metrics require the application to expose a metrics library.
Long GC pauses degrade response time.
Too many threads can cause context‑switch overhead.
Practical Advice
Heap usage: keep between 70 %–80 %.
GC frequency: stay within reasonable limits.
GC pause: single pause < 100 ms.
🔍 Case 9: Business Metric Monitoring & Anomaly Detection
Scenario
Technical metrics are only part of monitoring; business metrics reflect the true health of the system. We need to define suitable business metrics per scenario.
Query
# User registration success rate
sum(rate(user_registration_total{status="success"}[5m])) /
sum(rate(user_registration_total[5m])) * 100
# Order payment success rate
sum(rate(payment_total{status="success"}[5m])) /
sum(rate(payment_total[5m])) * 100
# API traffic anomaly detection (based on historical data)
(
sum(rate(http_requests_total[5m])) -
avg_over_time(sum(rate(http_requests_total[5m]))[1d:5m])
) / avg_over_time(sum(rate(http_requests_total[5m]))[1d:5m]) * 100 > 50
# User activity monitoring
increase(active_users_total[1h])
# Error log growth rate
rate(log_messages_total{level="error"}[5m])
# Business metric trend prediction (using predict_linear)
predict_linear(revenue_total[1h], 3600)Key Points
Business metrics must be reported by the application. predict_linear() can be used for simple trend prediction. avg_over_time() calculates the average over a time window.
Practical Advice
Core business processes: each critical step needs monitoring.
Anomaly detection: establish baselines from historical data.
Real‑time alerts: business metric anomalies require immediate response.
🎯 Case 10: Comprehensive Dashboard & SLI/SLO Monitoring
Scenario
Integrate various metrics into a comprehensive dashboard for end‑to‑end monitoring, and establish SLI/SLO‑based service quality guarantees.
Query
# Overall system health score
(
# CPU weight 0.3
(100 - avg(100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100))) * 0.3 +
# Memory weight 0.3
(100 - avg((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100)) * 0.3 +
# Network weight 0.2
(100 - avg(rate(node_network_receive_errs_total[5m]) + rate(node_network_transmit_errs_total[5m])) * 100) * 0.2 +
# Application weight 0.2
(avg(sum(rate(http_requests_total{status=~"2.."}[5m])) / sum(rate(http_requests_total[5m])) * 100)) * 0.2
)
# SLI: availability
avg_over_time(
(sum(rate(http_requests_total{status!~"5.."}[5m])) /
sum(rate(http_requests_total[5m])))
)[30d:5m]) * 100
# SLI: latency (P99 < 1s proportion)
avg_over_time(
(sum(rate(http_request_duration_seconds_bucket{le="1.0"}[5m])) /
sum(rate(http_request_duration_seconds_count[5m])))
)[30d:5m]) * 100
# Error budget consumption rate
(1 -
sum(rate(http_requests_total{status!~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))) / (1 - 0.999) * 100
# Alert fatigue monitoring
sum(increase(prometheus_notifications_total[1d])) by (alertname)Key Points
Overall health score requires reasonable weighting of each metric.
SLO settings should be based on business needs and historical data.
Error budget is a crucial tool to balance reliability and development speed.
Practical Advice
Start simple: define core SLIs first.
Set realistic targets: SLOs should be challenging yet achievable.
Review regularly: adjust SLOs as the business evolves.
📈 Monitoring System Construction Summary
Through the ten practical cases above, we have built a complete monitoring system:
Infrastructure monitoring: CPU, memory, disk, network.
Application layer monitoring: service availability, performance metrics.
Database monitoring: connections, query performance.
Container monitoring: pod status, resource usage.
Business metric monitoring: key business processes.
Comprehensive performance dashboard: end‑to‑end view.
Key Construction Points
Layered monitoring: from infrastructure to business.
Prevention first: use monitoring to predict and avoid failures.
Rapid response: establish effective alerting mechanisms.
Continuous improvement: refine monitoring strategies based on real‑world feedback.
🎉 Conclusion
As an operations engineer, mastering PromQL is not just about learning a few query statements; the real value lies in making system state observable, predictable, and controllable.
In today’s digital transformation, ops engineers are no longer simple “fire‑fighters”; they become guarantors of business stability, optimizers of system performance, and preventers of technical risk.
If you found this article helpful, please like, bookmark, and follow! More operational practice will be shared continuously.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
