Mastering Application Monitoring with Prometheus: Practical Tips and Best Practices
This article explains how to design effective Prometheus metrics, choose appropriate vectors, labels, buckets, and naming conventions, and offers Grafana usage tricks to help engineers monitor online services, batch jobs, and offline processing systems with clear, actionable insights.
Mastering Application Monitoring with Prometheus
In this article we introduce how to use Prometheus for application monitoring, summarizing practical metrics practices based on our experience and the official documentation.
Identify Monitoring Targets
Before designing metrics, clearly define what needs to be measured based on the problem context, requirements, and the system itself.
From Requirements
Google’s four golden metrics for large‑scale distributed monitoring are latency, traffic, errors, and saturation. They serve general monitoring needs:
Latency: Service request time.
Traffic: Current system flow, indicating capacity needs.
Errors: Rate of error requests.
Saturation: Degree to which a critical resource (e.g., memory) limits the service.
These metrics address four monitoring goals: reflecting user experience, measuring throughput, helping locate faults, and indicating saturation/load.
Additional custom metrics may be needed for specific scenarios, such as measuring the latency and failure count of a frequently called library.
From System Types
The official best‑practice classifies applications into three categories, each with different typical measurement objects:
Online‑serving systems (e.g., web servers): request count, error count, latency, etc.
Offline processing systems (e.g., Spark batch jobs): job start time, running job count, items processed, queue length, etc.
Batch jobs (e.g., MapReduce): completion time, stage durations, total runtime, record count, etc.
Subsystems such as libraries, logging, failures, thread pools, and caches may also require monitoring.
Choose Vectors
When selecting vectors, ensure the data types are similar but differ in resource type or collection location, and keep units consistent within a vector.
Examples include request latency across different resources, latency across regions, and error counts per HTTP request.
Official guidance recommends using separate metrics for different operations (e.g., Read vs Write) rather than combining them.
Determine Labels
Common label choices include
resource,
region, and
type. Labels should be additive and comparable; units must be consistent within a label dimension.
Avoid mixing raw counts and aggregated totals in the same label—use PromQL aggregation or separate metrics for totals.
Name Metrics and Labels
Good names make purpose clear. Metric names should follow the pattern
a-zA-Zand include a domain prefix and a unit suffix when applicable (e.g.,
http_request_duration_seconds,
node_memory_usage_bytes).
Label names should reflect the dimension they represent, such as
region: shenzhen/guangzhou/beijing,
owner: user1/user2, or
stage: extract/transform/load.
Select Buckets
Appropriate buckets improve histogram percentile accuracy. Use default buckets or exponential buckets, adjust based on observed data distribution, and add finer buckets around critical percentiles (e.g., 90%).
Grafana Tips
View All Dimensions
To discover available dimensions, query only the metric name without calculations and leave the legend format empty. The resulting view shows raw metric data.
Scale Synchronization
In the Settings panel, change the Graph Tooltip to "Shared crosshair" or "Shared Tooltip" to enable linked scales, making it easier to correlate two metrics.
my_metric{label="a"} 1
my_metric{label="b"} 6
my_metric{label="total"} 7Source: https://lxkaka.wang/metrics-best-practice/#grafana-%E4%BD%BF%E7%94%A8%E6%8A%80%E5%B7%A7
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.