Operations 10 min read

Mastering Prometheus: Practical Tips for Effective Application Monitoring

This article explains how to design and implement Prometheus metrics for application monitoring, covering the selection of monitoring targets, golden metrics, label conventions, naming rules, histogram bucket choices, and Grafana visualization tricks to help engineers build reliable observability pipelines.

Efficient Ops

Feb 19, 2024

Mastering Prometheus: Practical Tips for Effective Application Monitoring

Using Prometheus for Application Monitoring Practices

In this article we introduce how to use Prometheus to monitor applications and share practical metrics recommendations based on our experience and the official documentation.

Identify Monitoring Targets

Before designing metrics, clearly define what needs to be measured based on the problem context, requirements, and the system itself.

From Requirements

Google’s four golden metrics for large‑scale distributed monitoring are latency, traffic, errors, and saturation, which serve as useful references for most scenarios.

Latency : time taken to serve a request.

Traffic : volume of requests, indicating capacity needs.

Errors : rate of error requests.

Saturation : degree to which a critical resource (e.g., memory) is constrained.

These metrics address four monitoring goals:

Reflect user experience and core performance (e.g., latency, job completion time).

Measure system throughput (e.g., request count, network packet size).

Help discover and locate faults (e.g., error count, failure rate).

Show system saturation and load (e.g., memory usage, queue length).

Additional custom metrics may be added for specific scenarios, such as measuring the latency and failure count of a frequently called library interface.

From the System Type

Official best practices classify applications into three categories, each with typical measurement objects:

Online‑serving systems: request count, error count, latency, etc.

Offline processing systems: job start time, number of running jobs, items processed, queue length.

Batch jobs: final execution time, stage durations, total runtime, record count.

Sub‑systems may also be monitored, including libraries (call count, success/failure, latency), logging (log entry frequency), failures (error count), thread pools (queued requests, active threads, total threads), and caches (request count, hits, total latency).

Select Metric Vectors

Choose vectors based on similar data types but different resources or collection points, and ensure uniform units within a vector. Examples include measuring request latency across different resources, regions, or HTTP error types.

The official docs recommend using separate metrics for different operations (e.g., read vs. write) rather than combining them.

Determine Labels

Common label choices include resource, region, type, etc. Labels should be additive and comparable; units must be consistent within a label dimension.

Avoid mixing raw counts and totals in the same label set; instead aggregate totals with PromQL or use a separate metric.

Name Metrics and Labels

Good names are self‑descriptive. Metric names should follow the pattern a-zA-Z and include a domain prefix and a unit suffix, using base units like seconds or bytes. Examples:

prometheus_notifications_total

process_cpu_seconds_total

ipamd_request_latency

http_request_duration_seconds

node_memory_usage_bytes

Label names reflect the chosen dimension, such as region: shenzhen/guangzhou/beijing, owner: user1/user2, or stage: extract/transform/load.

Bucket Selection for Histograms

Appropriate buckets improve percentile accuracy. Ideally, bucket counts are roughly equal across ranges. Use default buckets ({0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10}) or exponential buckets for latency with long‑tail distribution. Adjust bucket boundaries based on observed data, widening them in sparse regions and narrowing where data is dense, especially around critical percentiles like the 90th.

my_metric{label=a} 1 my_metric{label=b} 6 my_metric{label=total} 7

Grafana Usage Tips

View All Dimensions

To discover additional grouping dimensions, keep only the metric name in the query expression and leave the legend format empty; the raw metric data will be displayed.

Ruler Linking

In the Settings panel, change the Graph Tooltip to “Shared crosshair” or “Shared Tooltip” to enable synchronized ruler display across multiple metrics, facilitating correlation analysis.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations Observability metrics Prometheus Grafana

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.