Mastering Prometheus: From Metrics Basics to High‑Availability Monitoring
This article shares practical experiences of using Prometheus for monitoring complex services, covering metric types, PromQL query techniques, naming conventions, service discovery with file‑based configs, high‑availability sharding, alerting via Alertmanager, and visualisation with Grafana, providing actionable guidance for reliable observability.
Introduction
Recent projects required strong monitoring of API request counts, latency, storage IOPS, node status, and offsets. After reading Google SRE, the team tried Prometheus as a unified monitoring solution.
Agenda
Prometheus basics
Prometheus instrumentation and query tips
High‑availability and service discovery experience
Prometheus Characteristics
Multi‑dimensional data model (metric name + label set)
Powerful query language (PromQL)
Storage‑agnostic (local or remote back‑ends such as OpenTSDB, InfluxDB)
Pull‑based HTTP data collection
Service discovery or static target configuration
Rich visualization via Grafana
Metric Types
Counter : monotonically increasing values (e.g., request counts)
Gauge : instantaneous values that can go up or down (e.g., online users)
Histogram : bucketed samples over a time range, useful for latency or size distribution
Summary : similar to Histogram but stores quantiles directly
Metric Format and Example
Each time‑series consists of a metric name, a set of labels, and a float64 value. Standard format: <metric_name>{label1="value1",...} value
<code>rpc_invoke_cnt_c{code="0",method="Session.GenToken",job="Center"} 5
rpc_invoke_cnt_c{code="0",method="Relation.GetUserInfo",job="Center"} 12
rpc_invoke_cnt_c{code="0",method="Message.SendGroupMsg",job="Center"} 12
rpc_invoke_cnt_c{code="4",method="Message.SendGroupMsg",job="Center"} 3
rpc_invoke_cnt_c{code="0",method="Tracker.Tracker.Get",job="Center"} 70</code>This records RPC invocation counts with labels code , method , and job .
PromQL
PromQL is Prometheus’ own query language, offering rich expressions, functions, and aggregations.
<code>rate(rpc_invoke_cnt_c{method="Relation.GetUserInfo",job="Center"}[1m])</code> <code>sum by (method, code)(rate(rpc_invoke_cnt_c{job="Center",code!="0"}[1m]))</code> <code>rate(rpc_invoke_time_h_sum{job="Center"}[1m]) / rate(rpc_invoke_time_h_count{job="Center"}[1m])</code>Naming Conventions
Prefer generic metric names and use labels to differentiate components. For example, use rpc_invoke_cnt_c for all services and set the job label to Center , Gateway , or Message . This reduces the number of distinct metrics and simplifies queries.
Service Discovery
Initially each project deployed an independent Prometheus instance. As the number of servers grew, file‑based service discovery was adopted. A JSON file lists targets and associated labels, and Prometheus watches the file for changes.
<code>[
{
"targets": ["10.10.10.1:65160","10.10.10.2:65160"],
"labels": {"job":"Center","service":"qtest"}
},
{
"targets": ["10.10.10.3:65110","10.10.10.4:65110"],
"labels": {"job":"Gateway","service":"qtest"}
}
]</code>Prometheus configuration points file_sd_configs to these JSON files.
High Availability
Sharding and federation can provide HA, but edge and global nodes remain single points of failure. Example shard configuration uses modulus and hashmod to distribute targets across three Prometheus nodes.
<code>global:
external_labels:
slave: 0
scrape_configs:
- job_name: myjob
file_sd_configs:
- files: ['/usr/local/prometheus/qtestgroups/*.json']
relabel_configs:
- source_labels: [__address__]
modulus: 3
target_label: __tmp_hash
action: hashmod
- source_labels: [__tmp_hash]
regex: ^0$
action: keep
</code>Alerting
Alertmanager handles alerts. A simple rule that fires when Ceph slow requests exceed ten for one minute looks like:
<code>ALERT SlowRequest
IF ceph_slow_requests{service="ceph"} > 10
FOR 1m
LABELS { qalarm = "true" }
ANNOTATIONS {
summary = "Ceph Slow Requests",
description = "slow requests count: {{ $value }} - Region:{{ $labels.group }}"
}
</code>Visualization
Grafana dashboards display the collected metrics. Templates allow variables (e.g., label values) to be substituted, enabling a single dashboard to serve multiple services.
Q&A
Typical questions cover Alertmanager usage, exporter development with the Prometheus Go SDK, and how to delete stale job series via the Prometheus HTTP API.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.