Mastering Prometheus Metrics: Best Practices for Effective Monitoring
This article outlines practical guidelines for designing Prometheus metrics, covering how to define monitoring targets, choose appropriate vectors and labels, name metrics and labels correctly, select histogram buckets, and leverage Grafana features to visualize and troubleshoot data effectively.
Determine Monitoring Objects
Before designing specific metrics, you must first identify what you need to measure. The measurement target should be defined based on the problem context, requirements, and the system being monitored.
Start from Requirements
Google’s experience with large‑scale distributed monitoring identifies four golden metrics that are generally applicable:
Latency: service request time.
Traffic: the volume of the system, used to gauge capacity needs.
Errors: all error requests, measuring error rate.
Saturation: indicates how saturated the service is, focusing on the most constrained resource (e.g., memory).
These four metrics satisfy four monitoring needs:
Reflect user experience and core performance (e.g., request latency, job completion time).
Reflect system throughput (e.g., request count, network packet size).
Help discover and locate faults (e.g., error count, failure rate).
Reflect system saturation and load (e.g., memory usage, queue length). Additional custom metrics may be added for specific scenarios.
Start from the Monitored System
The official best‑practice guide classifies applications into three categories:
Online‑serving systems: require immediate response, such as web servers.
Offline processing systems: do not require immediate response; jobs may be long‑running, e.g., Spark.
Batch jobs: one‑off tasks that finish and exit, e.g., MapReduce data analysis.
Typical measurement objects for each category are:
Online‑serving systems: request count, error count, request latency, etc.
Offline processing systems: job start time, number of jobs in progress, items emitted, queue length, etc.
Batch jobs: completion time, stage execution times, total duration, processed record count, etc.
In addition to the main system, you may need to monitor subsystems:
Libraries: call count, success count, error count, latency.
Logging: count each log entry to determine frequency and timing.
Failures: error count; Thread pools: queued requests, active threads, total threads, latency, tasks in progress.
Caches: request count, hit count, total latency, etc.
Choose Vectors
Principles for selecting vectors:
Data types are similar but resources, collection locations, etc., differ.
All data within a vector share the same unit.
Examples:
Request latency for different resource objects.
Request latency for different geographic regions.
Error counts for different HTTP request types.
…
The official documentation also recommends using separate metrics for different operations (e.g., Read/Write, Send/Receive) rather than combining them into a single metric, because they are usually observed separately.
When measuring requests, labels are typically used to distinguish different actions.
Determine Labels
Common label choices include:
resource
region
type
…
A key principle is that data for the same label dimension must be additive and comparable, meaning units must be consistent (e.g., you cannot mix fan speed with voltage in the same label).
Bad practice: mixing per‑label and total values in one metric.
my_metric{label=a} 1 my_metric{label=b} 6 my_metric{label=total} 7Instead, aggregate totals on the server side with PromQL or use a separate metric for the total.
Naming Metrics and Labels
Metric Naming
Follow the pattern a‑zA‑Z*:*.
Include a prefix that indicates the domain, e.g.,
prometheus_notifications_total,
process_cpu_seconds_total,
ipamd_request_latency.
Include a unit suffix to show the metric’s unit, e.g.,
http_request_duration_seconds,
node_memory_usage_bytes,
http_requests_total(unit‑less count).
The name should reflect the meaning of the measured variable.
Prefer base units (seconds, bytes) over derived ones (milliseconds, megabytes).
Label Naming
Name labels based on the chosen dimension, for example:
region: shenzhen, guangzhou, beijing
owner: user1, user2, user3
stage: extract, transform, load
Choosing Buckets
Appropriate buckets make histogram percentile calculations more accurate. Ideally, bucket boundaries produce roughly equal counts per bucket.
If the data distribution is unknown, start with default buckets ({0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10}) or exponential buckets (1, 2, 4, 8, …) and adjust after observing the data.
Narrow bucket intervals where data is dense; widen them where data is sparse.
For most latency data, which has a long‑tail characteristic, exponential buckets are suitable.
The initial bucket upper bound should cover roughly 10% of the data; it can be larger if the head of the distribution is not critical.
To compute a specific percentile accurately (e.g., 90%), add more buckets around that percentile by reducing the interval.
In practice, we estimate bucket ranges based on observed task durations, deploy, observe, and iteratively adjust until the buckets are appropriate.
Grafana Tips
View All Dimensions
To discover additional grouping dimensions, keep only the metric name in the query (no calculations) and leave the Legend format empty. Grafana will then display the raw metric data with all available label dimensions.
Scale Linking
In the Settings panel, the Graph Tooltip option defaults to “Default”. Switching the graph display tool to “Shared crosshair” and “Shared Tooltip” links the scales of multiple metrics, making it easier to correlate them during troubleshooting.
Original source: https://lxkaka.wang/metrics-best-practice/
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.