Mastering Prometheus Metrics: Practical Best‑Practice Guide for Effective Monitoring
This guide explains how to design and implement Prometheus metrics for application monitoring, covering the selection of monitoring targets, the four golden metrics, system‑specific metric groups, vector and label choices, naming conventions, histogram bucket design, and useful Grafana visualization tips.
Introduction
The article introduces practical ways to use Prometheus for monitoring applications, summarizing experience‑based metrics practices derived from official documentation.
Determining Monitoring Objects
Before defining metrics, identify what needs to be measured based on the problem background and system requirements. Google’s four golden metrics—latency, traffic, errors, and saturation—serve as a reference for most monitoring scenarios.
Latency: service request time.
Traffic: system throughput, useful for capacity planning.
Errors: rate of error requests.
Saturation: resource constraints that most affect service health (e.g., memory usage).
These metrics address four monitoring needs: user experience, system throughput, fault detection, and resource saturation.
System‑Specific Metric Groups
Official best practices categorize applications into three types, each with typical metrics:
Online‑serving systems (e.g., web servers): request count, error count, request latency.
Offline processing systems (e.g., Spark): job start time, number of running jobs, items emitted, queue length.
Batch jobs (e.g., MapReduce): final execution time, stage durations, total processing time, record count.
Additional sub‑systems may also be monitored, such as libraries (call count, success/failure, latency), logging (log entry frequency), failures (error count), thread pools (queued requests, active threads, total threads), and caches (request count, hit count, total latency).
Choosing Vectors
Group metrics into vectors when data types are similar but resources differ, ensuring uniform units within a vector. Examples include measuring request latency across different resources, regions, or HTTP error types.
Determining Labels
Common label dimensions include resource, region, type, etc. Labels must be additive and comparable; mixing totals with component values in the same label is discouraged. Example of a bad practice:
my_metric{label=a} 1 my_metric{label=b} 6 my_metric{label=total} 7Instead, aggregate totals server‑side with PromQL or use a separate metric for totals.
Naming Metrics and Labels
Metric naming guidelines :
Match pattern a-zA-Z*:*.
Include a domain‑specific prefix (e.g., prometheus_notifications_total, process_cpu_seconds_total, ipamd_request_latency).
Append a unit suffix (e.g., http_request_duration_seconds, node_memory_usage_bytes, http_requests_total for unit‑less counters).
Reflect the measured variable’s meaning.
Prefer base units (seconds, bytes) over derived ones.
Label naming guidelines follow the chosen dimension, such as region=shenzhen, owner=user1, stage=extract.
Bucket Selection for Histograms
Appropriate bucket boundaries improve percentile accuracy. Ideally, each bucket contains a similar number of samples, forming a step‑wise distribution. Recommendations:
Start with default buckets ( {0.005,0.01,0.025,0.05,0.1,0.25,0.5,1,2.5,5,10}) or powers‑of‑two buckets, then refine based on observed data.
Use narrower intervals where data is dense, wider intervals where sparse.
For latency with long‑tail characteristics, exponential buckets work well.
Initial bucket upper bound should cover roughly 10% of data; increase if head‑region is not critical.
To compute specific percentiles (e.g., 90%), add extra buckets around the target percentile.
In practice, adjust buckets iteratively after observing real‑world metrics.
Grafana Usage Tips
Viewing all dimensions : Query only the metric name without aggregation and leave the legend format empty; Grafana will list all label combinations.
Scale linking : In the Settings panel, change the Graph Tooltip to “Shared crosshair” or “Shared Tooltip” to see linked scales across panels, aiding correlation analysis.
Switching to “Shared Tooltip” demonstrates how the ruler synchronizes across graphs, simplifying issue diagnosis.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
