Operations 10 min read

Mastering Application Monitoring with Prometheus: Practical Tips and Best Practices

This guide explains how to design effective Prometheus metrics, choose appropriate monitoring objects, labels, and buckets, and leverage Grafana visualizations to gain deep insight into application performance across online services, offline processing, and batch jobs.

Efficient Ops
Efficient Ops
Efficient Ops
Mastering Application Monitoring with Prometheus: Practical Tips and Best Practices

Using Prometheus for Application Monitoring: Practical Practices

In this article we introduce how to use Prometheus to monitor applications, summarizing metric design practices based on our experience and official documentation.

Identify Monitoring Targets

Before designing metrics, clearly define what needs to be measured based on the problem context, requirements, and the system itself.

Start from Requirements

Google’s four golden metrics for large‑scale distributed monitoring—latency, traffic, errors, and saturation—provide a solid reference for most monitoring scenarios.

Latency: Service request time.

Traffic: Current system load, indicating capacity needs.

Errors: Rate of error requests.

Saturation: Degree to which a critical resource (e.g., memory) limits the service.

These metrics address four monitoring goals:

Reflect user experience and core performance (e.g., request latency).

Show system throughput (e.g., request count, network packet size).

Help discover and locate faults (e.g., error count, failure rate).

Indicate system saturation and load (e.g., memory usage, queue length).

Beyond these common needs, additional metrics may be defined for specific scenarios, such as measuring the latency and failure count of a frequently called library interface.

Start from the System to Be Monitored

Official best practices categorize applications into three types:

Online‑serving systems (e.g., web servers) that require immediate responses.

Offline processing systems (e.g., Spark batch jobs) where the caller does not wait for a response.

Batch jobs (e.g., one‑off MapReduce tasks) that run to completion and then exit.

Typical measurement objects differ per category:

Online services: request count, error count, request latency.

Offline processing: job start time, number of running jobs, items processed, queue length.

Batch jobs: final execution time, stage durations, total runtime, record count.

Additional subsystems may also be monitored, such as libraries (call count, success/failure, latency), logging (log entry frequency), failures (error count), thread pools (queued requests, active threads), and caches (request count, hits, total latency).

Choose Vectors

When selecting a vector, ensure the data types are similar but differ in resource type or collection location, and keep units consistent within the vector.

Determine Labels

Common label choices include

resource

,

region

,

type

, etc. Labels must be additive and comparable; units should be uniform (e.g., do not mix fan speed with voltage).

Avoid aggregating both per‑label and total values in the same metric; instead use PromQL aggregation on the server side or define a separate metric for totals.

Name Metrics and Labels

Good naming makes purpose clear.

Metric Naming

Use the pattern

a-zA-Z

with a colon separator.

Include a domain prefix (e.g.,

prometheus_notifications_total

).

Append a unit suffix (e.g.,

http_request_duration_seconds

,

node_memory_usage_bytes

).

Prefer base units like

seconds

or

bytes

, not

Milliseconds

or

megabytes

.

Label Naming

Base names on the chosen dimension, such as

region: shenzhen/guangzhou/beijing

,

owner: user1/user2

, or

stage: extract/transform/load

.

Bucket Selection

Appropriate buckets improve histogram percentile accuracy. Use default buckets (

{0.005,0.01,0.025,0.05,0.1,0.25,0.5,1,2.5,5,10}

) or exponential buckets for latency data, adjusting based on observed distribution and focusing on the 90th percentile if needed.

Grafana Tips

View All Dimensions

To discover additional grouping dimensions, query only the metric name without calculations and leave the legend format empty; Grafana will display the raw metric data.

Ruler Linking

In the Settings panel, change the Graph Tooltip to Shared crosshair or Shared Tooltip to enable synchronized ruler display across multiple metrics, aiding correlation analysis.

monitoringobservabilityDevOpsmetricsPrometheusGrafana
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.