Operations 10 min read

Mastering Application Monitoring with Prometheus: Practical Metrics and Best Practices

This guide walks through how to design and implement effective Prometheus metrics for various application types, covering golden metrics, label selection, naming conventions, histogram bucket choices, and Grafana visualization tricks to improve observability and operational insight.

Efficient Ops
Efficient Ops
Efficient Ops
Mastering Application Monitoring with Prometheus: Practical Metrics and Best Practices

Using Prometheus for Application Monitoring

This guide summarizes practical recommendations for designing Prometheus metrics based on common monitoring goals, application categories, and best‑practice naming conventions.

Core Monitoring Goals (Google’s Golden Metrics)

Latency – time to serve a request.

Traffic – request volume or data transferred.

Errors – rate of failed requests.

Saturation – degree to which a critical resource (e.g., memory, CPU) is constrained.

These metrics address four objectives: reflect user experience, measure throughput, aid fault detection, and expose system load.

Typical Measurement Objects by Application Type

Online‑serving systems : request count, error count, request latency, etc.

Offline processing systems : job start time, number of running jobs, items emitted, queue length, etc.

Batch jobs : completion timestamp, stage execution times, total duration, processed record count, etc.

Additional subsystems often monitored include:

Libraries – call count, successes, failures, latency.

Logging – number of log entries written.

Failures – error counts.

Thread pools – queued requests, active threads, total threads, task duration.

Caches – request count, hits, total latency.

Choosing Metric Vectors

A vector (a distinct metric) should group data of the same type while differing in resource, location, or operation. Units within a vector must be uniform. Typical examples:

Request latency per resource.

Request latency per region.

HTTP error counts by status code.

Label Design

Common label dimensions are resource, region, type, etc. Labels must be additive and comparable; mixing total values with per‑label values in the same metric is discouraged.

my_metric{label="a"} 1
my_metric{label="b"} 6
# total should be computed with PromQL, not stored as a separate label value

Metric and Label Naming Conventions

Metric names may contain only letters (a‑z, A‑Z) and underscores.

Start with a domain‑specific prefix (e.g., myapp_).

End with a unit suffix that reflects the measurement unit, using base units such as seconds or bytes (e.g., http_request_duration_seconds, node_memory_usage_bytes).

Keep the name semantically aligned with the measured variable.

Example label values: region: shenzhen , guangzhou , beijing owner: user1 , user2 , user3 stage: extract , transform , load

Histogram Bucket Selection

Well‑chosen buckets improve percentile accuracy. Recommended practice:

Start with the default bucket set {0.005,0.01,0.025,0.05,0.1,0.25,0.5,1,2.5,5,10} or exponential buckets if the distribution is unknown.

Use narrower intervals where observations are dense and wider intervals where they are sparse.

For latency with a long tail, exponential buckets (e.g., ExponentialBuckets(0.01, 2, 10)) are often appropriate.

Initial bucket upper bounds should cover roughly the first 10 % of observations; adjust based on the need for head‑region detail.

To compute a specific percentile (e.g., 90th), add finer buckets around that value.

Grafana Tips for Exploring Metrics

Discover All Label Dimensions

Query the metric name without aggregation and leave the legend format empty. The returned series list reveals every label dimension.

Grafana label discovery
Grafana label discovery

Shared Crosshair / Tooltip

In Grafana’s Settings panel, change Graph Tooltip from “Default” to “Shared crosshair” or “Shared Tooltip”. This synchronises the cursor and tooltip across multiple panels, simplifying correlation of related metrics.

Grafana tooltip settings
Grafana tooltip settings
Grafana shared tooltip view
Grafana shared tooltip view
operationsMetricsPrometheusGrafana
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.