Operations 10 min read

Mastering Application Monitoring with Prometheus: Practical Metrics and Grafana Tips

This article explains how to design effective Prometheus metrics for various application types, choose appropriate vectors, labels, and buckets, and offers Grafana tricks for visualizing dimensions and linking tooltips, providing a comprehensive guide for robust observability.

Efficient Ops
Efficient Ops
Efficient Ops
Mastering Application Monitoring with Prometheus: Practical Metrics and Grafana Tips

Using Prometheus for Application Monitoring: Practices

In this article we introduce how to use Prometheus to monitor applications and share practical metrics practices based on our experience and the official documentation.

Determine Monitoring Targets

Before designing metrics, clearly define what needs to be measured according to the problem background, requirements, and the system itself.

From Requirements

Google’s four golden metrics for large‑scale distributed monitoring are latency, traffic, errors, and saturation, which serve as references for most monitoring scenarios.

Latency : service request time.

Traffic : current system flow, indicating capacity needs.

Errors : rate of error requests.

Saturation : degree to which a constrained resource (e.g., memory) limits the service.

These metrics address four monitoring needs:

Reflect user experience and core performance (e.g., online latency, job completion time).

Reflect system throughput (e.g., request count, network packet size).

Help discover and locate faults (e.g., error count, failure rate).

Reflect system saturation and load (e.g., memory usage, queue length).

Additional custom metrics can be defined for specific scenarios, such as measuring the latency and failure count of a frequently called library interface.

From System Types

The official best‑practice guide classifies applications into three categories:

Online‑serving systems (e.g., web servers) – need request latency, error count, etc.

Offline processing systems (e.g., Spark) – monitor job start time, running jobs, items processed, queue length.

Batch jobs (e.g., MapReduce) – track final execution time, stage durations, record counts.

Subsystems such as libraries, logging, failures, thread pools, and caches also have specific metrics (e.g., call count, success/failure count, log entry rate, thread usage, cache hit rate).

Choose Vectors

Select vectors when the data type is similar but the resource type or collection location differs, and ensure uniform units within a vector. Examples include request latency across different resources, regional server latency, or HTTP error counts.

The documentation recommends using separate metrics for different operations (e.g., read vs. write) rather than aggregating them.

Determine Labels

Common label choices are

resource

,

region

,

type

, etc. Labels should be additive and comparable; units must be consistent within a label dimension.

Avoid mixing total and partial counts in the same label, as shown below:

my_metric{label=a} 1 my_metric{label=b} 6 my_metric{label=total} 7

Instead, aggregate totals on the server side with PromQL or use a separate metric for totals.

Name Metrics and Labels

Good naming makes purpose clear. Metric names should follow the pattern

[a-zA-Z_]+

and include a domain prefix and a unit suffix (e.g.,

http_request_duration_seconds

,

node_memory_usage_bytes

). Use base units like seconds or bytes.

Label names should reflect the chosen dimension, such as

region

(shenzhen/guangzhou/beijing),

owner

(user1/user2), or

stage

(extract/transform/load).

Select Buckets for Histograms

Appropriate buckets improve percentile accuracy. Start with default buckets ({0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10}) or exponential buckets, then adjust based on observed distribution. Narrow buckets where data is dense, widen where sparse. For long‑tail latency data, exponential buckets are often suitable.

Grafana Usage Tips

View All Dimensions

To discover additional grouping dimensions, query only the metric name without any aggregation and leave the legend format empty. This displays the raw metric data.

Scale Linking

In Grafana’s Settings panel, change the Graph Tooltip from Default to Shared crosshair or Shared Tooltip . This links the ruler across panels, making it easier to correlate two metrics during troubleshooting.

MonitoringobservabilitymetricsBest PracticesPrometheusGrafana
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.