Operations 15 min read

Mastering Prometheus: From Metrics Basics to High‑Availability Monitoring

This article shares practical experiences of using Prometheus for monitoring complex services, covering metric types, PromQL query techniques, naming conventions, service discovery with file‑based configs, high‑availability sharding, alerting via Alertmanager, and visualisation with Grafana, providing actionable guidance for reliable observability.

360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Mastering Prometheus: From Metrics Basics to High‑Availability Monitoring

Introduction

Recent projects required strong monitoring of API request counts, latency, storage IOPS, node status, and offsets. After reading Google SRE, the team tried Prometheus as a unified monitoring solution.

Agenda

Prometheus basics

Prometheus instrumentation and query tips

High‑availability and service discovery experience

Prometheus Characteristics

Multi‑dimensional data model (metric name + label set)

Powerful query language (PromQL)

Storage‑agnostic (local or remote back‑ends such as OpenTSDB, InfluxDB)

Pull‑based HTTP data collection

Service discovery or static target configuration

Rich visualization via Grafana

Metric Types

Counter : monotonically increasing values (e.g., request counts)

Gauge : instantaneous values that can go up or down (e.g., online users)

Histogram : bucketed samples over a time range, useful for latency or size distribution

Summary : similar to Histogram but stores quantiles directly

Metric Format and Example

Each time‑series consists of a metric name, a set of labels, and a float64 value. Standard format: <metric_name>{label1="value1",...} value

<code>rpc_invoke_cnt_c{code="0",method="Session.GenToken",job="Center"} 5
rpc_invoke_cnt_c{code="0",method="Relation.GetUserInfo",job="Center"} 12
rpc_invoke_cnt_c{code="0",method="Message.SendGroupMsg",job="Center"} 12
rpc_invoke_cnt_c{code="4",method="Message.SendGroupMsg",job="Center"} 3
rpc_invoke_cnt_c{code="0",method="Tracker.Tracker.Get",job="Center"} 70</code>

This records RPC invocation counts with labels code , method , and job .

PromQL

PromQL is Prometheus’ own query language, offering rich expressions, functions, and aggregations.

<code>rate(rpc_invoke_cnt_c{method="Relation.GetUserInfo",job="Center"}[1m])</code>
<code>sum by (method, code)(rate(rpc_invoke_cnt_c{job="Center",code!="0"}[1m]))</code>
<code>rate(rpc_invoke_time_h_sum{job="Center"}[1m]) / rate(rpc_invoke_time_h_count{job="Center"}[1m])</code>

Naming Conventions

Prefer generic metric names and use labels to differentiate components. For example, use rpc_invoke_cnt_c for all services and set the job label to Center , Gateway , or Message . This reduces the number of distinct metrics and simplifies queries.

Service Discovery

Initially each project deployed an independent Prometheus instance. As the number of servers grew, file‑based service discovery was adopted. A JSON file lists targets and associated labels, and Prometheus watches the file for changes.

<code>[
  {
    "targets": ["10.10.10.1:65160","10.10.10.2:65160"],
    "labels": {"job":"Center","service":"qtest"}
  },
  {
    "targets": ["10.10.10.3:65110","10.10.10.4:65110"],
    "labels": {"job":"Gateway","service":"qtest"}
  }
]</code>

Prometheus configuration points file_sd_configs to these JSON files.

High Availability

Sharding and federation can provide HA, but edge and global nodes remain single points of failure. Example shard configuration uses modulus and hashmod to distribute targets across three Prometheus nodes.

<code>global:
  external_labels:
    slave: 0
scrape_configs:
  - job_name: myjob
    file_sd_configs:
      - files: ['/usr/local/prometheus/qtestgroups/*.json']
    relabel_configs:
      - source_labels: [__address__]
        modulus: 3
        target_label: __tmp_hash
        action: hashmod
      - source_labels: [__tmp_hash]
        regex: ^0$
        action: keep
</code>

Alerting

Alertmanager handles alerts. A simple rule that fires when Ceph slow requests exceed ten for one minute looks like:

<code>ALERT SlowRequest
IF ceph_slow_requests{service="ceph"} > 10
FOR 1m
LABELS { qalarm = "true" }
ANNOTATIONS {
  summary = "Ceph Slow Requests",
  description = "slow requests count: {{ $value }} - Region:{{ $labels.group }}"
}
</code>

Visualization

Grafana dashboards display the collected metrics. Templates allow variables (e.g., label values) to be substituted, enabling a single dashboard to serve multiple services.

Q&A

Typical questions cover Alertmanager usage, exporter development with the Prometheus Go SDK, and how to delete stale job series via the Prometheus HTTP API.

Prometheus UI example
Prometheus UI example
monitoringservice discoverymetricsalertingPrometheuspromqlGrafana
360 Zhihui Cloud Developer
Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.