Designing Effective Metrics: From Requirements to Labels and Buckets
This guide explains how to define, name, and organize monitoring metrics—covering Google’s four golden indicators, system‑specific measurement objects, vector selection, label conventions, bucket design, and practical Grafana tips—for reliable observability of diverse services.
Before designing metrics, clearly identify what needs to be measured based on the problem context, requirements, and the system being monitored.
From Requirements
Google’s experience with large‑scale distributed monitoring yields four golden metrics that are broadly applicable:
Latency: the time a service request takes.
Traffic: the current system’s load, used to gauge capacity needs.
Errors: the rate of error requests occurring in the system.
Saturation: the degree to which a critical resource (e.g., memory) limits the service.
These metrics satisfy four monitoring goals:
Reflect user experience and core performance (e.g., online latency, batch job completion time).
Reflect system throughput (e.g., request count, network packet volume).
Help discover and locate faults (e.g., error counts, failure rates).
Show system saturation and load (e.g., memory usage, queue length).
From the Monitored System
Different types of applications require different measurement objects. Official best‑practice documentation classifies applications into three categories:
Online‑serving systems: require immediate responses (e.g., web servers).
Offline processing systems: jobs run for a long time without the caller waiting (e.g., Spark).
Batch jobs: one‑off tasks that finish and exit (e.g., MapReduce data analysis).
Typical measurement objects per category are:
Online services: request count, error count, request latency.
Offline processing: job start time, number of active jobs, items emitted, queue length.
Batch jobs: completion timestamp, stage execution times, total duration, records processed.
In addition to the main system, sub‑systems may also be monitored:
Libraries: call count, successes, failures, latency.
Logging: count of log entries to determine frequency and timing.
Failures: error counts.
Thread pools: queued requests, active threads, total threads, latency, tasks in progress.
Caches: request count, hits, total latency.
Choosing Vectors
Guidelines for selecting a vector (a set of related metrics):
Data types are similar but resources, collection locations, etc., differ.
All data units within the vector are unified.
Examples include latency of different resource objects, latency across regions, or error counts for different HTTP request types.
Official documentation recommends using separate metrics for distinct operations (e.g., Read vs. Write) rather than combining them, and to differentiate actions with labels.
Determining Labels
Common label choices include:
resource
region
type
A key principle is that data for a given label dimension must be additive and comparable; units must be consistent. Avoid mixing partial and total counts in the same label (e.g., my_metric{label=a} 1, my_metric{label=total} 7). Use server‑side aggregation (PromQL) or separate metrics for totals.
Naming Metrics and Labels
Metric Naming
Follow the pattern
a-zA-Z*:*.
Include a prefix indicating the domain, such as
prometheus_notifications_total,
process_cpu_seconds_total, or
ipamd_request_latency.
Append a unit suffix to indicate the metric’s unit, e.g.,
http_request_duration_seconds,
node_memory_usage_bytes, or
http_requests_totalfor a unit‑less counter.
Make the name logically reflect the measured variable.
Prefer base units (seconds, bytes) over derived ones (milliseconds, megabytes).
Label Naming
Label names should reflect the chosen dimension, for example:
region: shenzhen/guangzhou/beijing
owner: user1/user2/user3
stage: extract/transform/load
Bucket Selection
Appropriate histogram buckets improve percentile calculations. Ideal buckets produce roughly equal counts per bucket. Guidelines:
Know the approximate data distribution; if unknown, start with default buckets (
{0.005,0.01,0.025,0.05,0.1,0.25,0.5,1,2.5,5,10}) or exponential buckets (
{1,2,4,8,…}) and adjust after observing data.
Use narrower intervals where data is dense, wider intervals where it is sparse.
For latency data with long‑tail characteristics, exponential buckets are often suitable.
Initial bucket upper bounds should cover about 10% of data; if head data is not critical, a larger upper bound is acceptable.
To compute a specific percentile (e.g., 90%), add finer buckets around the 90% point.
In practice, I selected bucket ranges based on observed task durations, deployed them, and refined the buckets after monitoring the results.
Grafana Tips
Viewing All Dimensions
To discover additional grouping dimensions, query only the metric name without any calculations and leave the Legend format empty. This displays the raw metric data.
Ruler Linking
In the Settings panel, adjust the Graph Tooltip option (default is "Default"). Switching the graph display to "Shared crosshair" or "Shared Tooltip" enables ruler linking, making it easier to correlate two metrics during troubleshooting.
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.