Mastering Prometheus: Practical Tips for Effective Application Monitoring
This article explains how to design and implement Prometheus metrics for application monitoring, covering the selection of monitoring targets, golden metrics, label conventions, naming rules, histogram bucket choices, and Grafana visualization tricks to help engineers build reliable observability pipelines.
Using Prometheus for Application Monitoring Practices
In this article we introduce how to use Prometheus to monitor applications and share practical metrics recommendations based on our experience and the official documentation.
Identify Monitoring Targets
Before designing metrics, clearly define what needs to be measured based on the problem context, requirements, and the system itself.
From Requirements
Google’s four golden metrics for large‑scale distributed monitoring are latency, traffic, errors, and saturation, which serve as useful references for most scenarios.
Latency : time taken to serve a request.
Traffic : volume of requests, indicating capacity needs.
Errors : rate of error requests.
Saturation : degree to which a critical resource (e.g., memory) is constrained.
These metrics address four monitoring goals:
Reflect user experience and core performance (e.g., latency, job completion time).
Measure system throughput (e.g., request count, network packet size).
Help discover and locate faults (e.g., error count, failure rate).
Show system saturation and load (e.g., memory usage, queue length).
Additional custom metrics may be added for specific scenarios, such as measuring the latency and failure count of a frequently called library interface.
From the System Type
Official best practices classify applications into three categories, each with typical measurement objects:
Online‑serving systems: request count, error count, latency, etc.
Offline processing systems: job start time, number of running jobs, items processed, queue length.
Batch jobs: final execution time, stage durations, total runtime, record count.
Sub‑systems may also be monitored, including libraries (call count, success/failure, latency), logging (log entry frequency), failures (error count), thread pools (queued requests, active threads, total threads), and caches (request count, hits, total latency).
Select Metric Vectors
Choose vectors based on similar data types but different resources or collection points, and ensure uniform units within a vector. Examples include measuring request latency across different resources, regions, or HTTP error types.
The official docs recommend using separate metrics for different operations (e.g., read vs. write) rather than combining them.
Determine Labels
Common label choices include
resource,
region,
type, etc. Labels should be additive and comparable; units must be consistent within a label dimension.
Avoid mixing raw counts and totals in the same label set; instead aggregate totals with PromQL or use a separate metric.
Name Metrics and Labels
Good names are self‑descriptive. Metric names should follow the pattern
a-zA-Zand include a domain prefix and a unit suffix, using base units like
secondsor
bytes. Examples:
prometheus_notifications_total
process_cpu_seconds_total
ipamd_request_latency
http_request_duration_seconds
node_memory_usage_bytes
Label names reflect the chosen dimension, such as
region: shenzhen/guangzhou/beijing,
owner: user1/user2, or
stage: extract/transform/load.
Bucket Selection for Histograms
Appropriate buckets improve percentile accuracy. Ideally, bucket counts are roughly equal across ranges. Use default buckets ({0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10}) or exponential buckets for latency with long‑tail distribution. Adjust bucket boundaries based on observed data, widening them in sparse regions and narrowing where data is dense, especially around critical percentiles like the 90th.
my_metric{label=a} 1 my_metric{label=b} 6 my_metric{label=total} 7Grafana Usage Tips
View All Dimensions
To discover additional grouping dimensions, keep only the metric name in the query expression and leave the legend format empty; the raw metric data will be displayed.
Ruler Linking
In the Settings panel, change the Graph Tooltip to “Shared crosshair” or “Shared Tooltip” to enable synchronized ruler display across multiple metrics, facilitating correlation analysis.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.