Operations 9 min read

Mastering Application Monitoring with Prometheus: Practical Tips and Best Practices

This article explains how to design effective Prometheus metrics, choose appropriate vectors, labels, buckets, and naming conventions, and offers Grafana usage tricks to help engineers monitor online services, batch jobs, and offline processing systems with clear, actionable insights.

Efficient Ops
Efficient Ops
Efficient Ops
Mastering Application Monitoring with Prometheus: Practical Tips and Best Practices

Mastering Application Monitoring with Prometheus

In this article we introduce how to use Prometheus for application monitoring, summarizing practical metrics practices based on our experience and the official documentation.

Identify Monitoring Targets

Before designing metrics, clearly define what needs to be measured based on the problem context, requirements, and the system itself.

From Requirements

Google’s four golden metrics for large‑scale distributed monitoring are latency, traffic, errors, and saturation. They serve general monitoring needs:

Latency: Service request time.

Traffic: Current system flow, indicating capacity needs.

Errors: Rate of error requests.

Saturation: Degree to which a critical resource (e.g., memory) limits the service.

These metrics address four monitoring goals: reflecting user experience, measuring throughput, helping locate faults, and indicating saturation/load.

Additional custom metrics may be needed for specific scenarios, such as measuring the latency and failure count of a frequently called library.

From System Types

The official best‑practice classifies applications into three categories, each with different typical measurement objects:

Online‑serving systems (e.g., web servers): request count, error count, latency, etc.

Offline processing systems (e.g., Spark batch jobs): job start time, running job count, items processed, queue length, etc.

Batch jobs (e.g., MapReduce): completion time, stage durations, total runtime, record count, etc.

Subsystems such as libraries, logging, failures, thread pools, and caches may also require monitoring.

Choose Vectors

When selecting vectors, ensure the data types are similar but differ in resource type or collection location, and keep units consistent within a vector.

Examples include request latency across different resources, latency across regions, and error counts per HTTP request.

Official guidance recommends using separate metrics for different operations (e.g., Read vs Write) rather than combining them.

Determine Labels

Common label choices include

resource

,

region

, and

type

. Labels should be additive and comparable; units must be consistent within a label dimension.

Avoid mixing raw counts and aggregated totals in the same label—use PromQL aggregation or separate metrics for totals.

Name Metrics and Labels

Good names make purpose clear. Metric names should follow the pattern

a-zA-Z

and include a domain prefix and a unit suffix when applicable (e.g.,

http_request_duration_seconds

,

node_memory_usage_bytes

).

Label names should reflect the dimension they represent, such as

region: shenzhen/guangzhou/beijing

,

owner: user1/user2

, or

stage: extract/transform/load

.

Select Buckets

Appropriate buckets improve histogram percentile accuracy. Use default buckets or exponential buckets, adjust based on observed data distribution, and add finer buckets around critical percentiles (e.g., 90%).

Grafana Tips

View All Dimensions

To discover available dimensions, query only the metric name without calculations and leave the legend format empty. The resulting view shows raw metric data.

Scale Synchronization

In the Settings panel, change the Graph Tooltip to "Shared crosshair" or "Shared Tooltip" to enable linked scales, making it easier to correlate two metrics.

my_metric{label="a"} 1
my_metric{label="b"} 6
my_metric{label="total"} 7

Source: https://lxkaka.wang/metrics-best-practice/#grafana-%E4%BD%BF%E7%94%A8%E6%8A%80%E5%B7%A7

monitoringoperationsobservabilitymetricsPrometheusGrafana
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.