Operations 11 min read

Mastering Prometheus Metrics: Best Practices for Effective Monitoring

This article outlines practical guidelines for designing Prometheus metrics, covering how to define monitoring targets, choose appropriate vectors and labels, name metrics and labels correctly, select histogram buckets, and leverage Grafana features to visualize and troubleshoot data effectively.

Efficient Ops
Efficient Ops
Efficient Ops
Mastering Prometheus Metrics: Best Practices for Effective Monitoring

Determine Monitoring Objects

Before designing specific metrics, you must first identify what you need to measure. The measurement target should be defined based on the problem context, requirements, and the system being monitored.

Start from Requirements

Google’s experience with large‑scale distributed monitoring identifies four golden metrics that are generally applicable:

Latency: service request time.

Traffic: the volume of the system, used to gauge capacity needs.

Errors: all error requests, measuring error rate.

Saturation: indicates how saturated the service is, focusing on the most constrained resource (e.g., memory).

These four metrics satisfy four monitoring needs:

Reflect user experience and core performance (e.g., request latency, job completion time).

Reflect system throughput (e.g., request count, network packet size).

Help discover and locate faults (e.g., error count, failure rate).

Reflect system saturation and load (e.g., memory usage, queue length). Additional custom metrics may be added for specific scenarios.

Start from the Monitored System

The official best‑practice guide classifies applications into three categories:

Online‑serving systems: require immediate response, such as web servers.

Offline processing systems: do not require immediate response; jobs may be long‑running, e.g., Spark.

Batch jobs: one‑off tasks that finish and exit, e.g., MapReduce data analysis.

Typical measurement objects for each category are:

Online‑serving systems: request count, error count, request latency, etc.

Offline processing systems: job start time, number of jobs in progress, items emitted, queue length, etc.

Batch jobs: completion time, stage execution times, total duration, processed record count, etc.

In addition to the main system, you may need to monitor subsystems:

Libraries: call count, success count, error count, latency.

Logging: count each log entry to determine frequency and timing.

Failures: error count; Thread pools: queued requests, active threads, total threads, latency, tasks in progress.

Caches: request count, hit count, total latency, etc.

Choose Vectors

Principles for selecting vectors:

Data types are similar but resources, collection locations, etc., differ.

All data within a vector share the same unit.

Examples:

Request latency for different resource objects.

Request latency for different geographic regions.

Error counts for different HTTP request types.

The official documentation also recommends using separate metrics for different operations (e.g., Read/Write, Send/Receive) rather than combining them into a single metric, because they are usually observed separately.

When measuring requests, labels are typically used to distinguish different actions.

Determine Labels

Common label choices include:

resource

region

type

A key principle is that data for the same label dimension must be additive and comparable, meaning units must be consistent (e.g., you cannot mix fan speed with voltage in the same label).

Bad practice: mixing per‑label and total values in one metric.

my_metric{label=a} 1 my_metric{label=b} 6 my_metric{label=total} 7

Instead, aggregate totals on the server side with PromQL or use a separate metric for the total.

Naming Metrics and Labels

Metric Naming

Follow the pattern a‑zA‑Z*:*.

Include a prefix that indicates the domain, e.g.,

prometheus_notifications_total

,

process_cpu_seconds_total

,

ipamd_request_latency

.

Include a unit suffix to show the metric’s unit, e.g.,

http_request_duration_seconds

,

node_memory_usage_bytes

,

http_requests_total

(unit‑less count).

The name should reflect the meaning of the measured variable.

Prefer base units (seconds, bytes) over derived ones (milliseconds, megabytes).

Label Naming

Name labels based on the chosen dimension, for example:

region: shenzhen, guangzhou, beijing

owner: user1, user2, user3

stage: extract, transform, load

Choosing Buckets

Appropriate buckets make histogram percentile calculations more accurate. Ideally, bucket boundaries produce roughly equal counts per bucket.

If the data distribution is unknown, start with default buckets ({0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10}) or exponential buckets (1, 2, 4, 8, …) and adjust after observing the data.

Narrow bucket intervals where data is dense; widen them where data is sparse.

For most latency data, which has a long‑tail characteristic, exponential buckets are suitable.

The initial bucket upper bound should cover roughly 10% of the data; it can be larger if the head of the distribution is not critical.

To compute a specific percentile accurately (e.g., 90%), add more buckets around that percentile by reducing the interval.

In practice, we estimate bucket ranges based on observed task durations, deploy, observe, and iteratively adjust until the buckets are appropriate.

Grafana Tips

View All Dimensions

To discover additional grouping dimensions, keep only the metric name in the query (no calculations) and leave the Legend format empty. Grafana will then display the raw metric data with all available label dimensions.

Scale Linking

In the Settings panel, the Graph Tooltip option defaults to “Default”. Switching the graph display tool to “Shared crosshair” and “Shared Tooltip” links the scales of multiple metrics, making it easier to correlate them during troubleshooting.

Original source: https://lxkaka.wang/metrics-best-practice/

MonitoringobservabilitymetricsPrometheusGrafana
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.