Operations 25 min read

Mastering End‑to‑End Monitoring: From Purpose to Prometheus Implementation

This guide explains why monitoring is essential throughout a product lifecycle, outlines monitoring modes and methods, compares health checks, logs, tracing and metric solutions, and provides a detailed Prometheus‑based monitoring architecture with concrete metric definitions, alerting rules, and incident‑response procedures.

dbaplus Community

May 18, 2021

Mastering End‑to‑End Monitoring: From Purpose to Prometheus Implementation

1. Purpose of Monitoring

Monitoring is the most critical part of operations and the entire product lifecycle. It aims to provide early warnings before failures, help locate problems during incidents, and supply data for post‑mortem analysis.

It serves both technical teams (understanding environment status) and business teams (ensuring continuous service). The core objectives are:

24×7 real‑time monitoring

Timely feedback of system status

Guarantee platform stability

Ensure service safety and reliability

Support continuous business operation

2. Monitoring Modes

Monitoring can be divided from top to bottom into:

Business monitoring – predefined business metrics, growth rate, error rate, etc.

Application monitoring – probes (external checks) and introspection (internal metrics).

Operating‑system monitoring – CPU, load, memory, disk usage, etc.

3. Monitoring Methods

The main methods are:

Health checks : verify whether a service is alive.

Log analysis : use logs to locate and solve problems.

Tracing (call‑chain monitoring) : capture the full request path and latency.

Metric monitoring : collect time‑series data for trend analysis and alerting.

Health checks are usually provided by cloud platforms, logs are collected by a log center, tracing has dedicated solutions (e.g., SkyWalking, Elastic APM), and metrics are gathered by exporters and processed by Prometheus.

4. Tool Selection

Health check : configure directly in the cloud platform.

Log : mature open‑source stack ELK.

Tracing : SkyWalking, Elastic APM, Pinpoint (Java/PHP only), Zikpin, Cat. SkyWalking and Elastic APM are non‑intrusive and support many languages; they are well suited for cloud‑native environments.

Metrics : In traditional environments Zabbix is common; in cloud‑native environments Prometheus is preferred because of its strong community, single‑binary deployment, pull model, powerful PromQL, rich ecosystem, and high performance.

5. Prometheus Monitoring System Overview

The overall architecture consists of:

Prometheus Server – scrapes metrics and stores time‑series data.

Exporter – exposes metrics for Prometheus to pull.

Pushgateway – receives pushed metrics.

Alertmanager – handles alert routing, silencing, and grouping.

Ad‑hoc query interface – for manual data inspection.

Data flow: exporters expose metrics → Prometheus pulls (or receives via Pushgateway) → stores in TSDB → Alertmanager evaluates rules → Grafana or other UI visualises data.

6. Metric Monitoring Objects

Monitoring objects are layered. The guide focuses on four categories:

Host monitoring – CPU, memory, disk, availability, service status, network.

Container environment monitoring – container CPU, memory, events.

Application service monitoring – HTTP endpoints, JVM, thread pools, connection pools, business KPIs.

Third‑party interface monitoring – response time, availability, success rate.

6.1 Host Monitoring

Key aspects:

Utilisation (average busy time)

Saturation (queue length)

Errors (error count)

Typical resources:

CPU – use node_cpu_seconds_total and alert when average usage over 5 minutes exceeds 60 %:

100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 60

Load – node_load5 compared to CPU count:

node_load5 > on (instance) 2 * count by(instance)(node_cpu_seconds_total{mode="idle"})

Memory – total, free, buffers, cached. Alert when used > 80 %:

100 - sum(node_memory_MemFree_bytes{job="node-exporter"} + node_memory_Buffers_bytes{job="node-exporter"} + node_memory_Cached_bytes{job="node-exporter"}) by (instance) / sum(node_memory_MemTotal_bytes{job="node-exporter"}) by (instance) * 100 > 80

Disk – monitor growth trend with predict_linear and combine with utilisation > 80 %:

(100 - (node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100) > 80) and (predict_linear(node_filesystem_free_bytes{job="node-exporter",mountpoint!=""}[1h], 4*3600) < 0)

Disk I/O – example metric:

100 - (avg(irate(node_disk_io_time_seconds_total[1m])) by (instance) * 100)

Availability – up{job="node-exporter"} == 0 indicates host down.

Service status –

node_systemd_unit_state{name="docker.service",state="active"} == 1

Network – inbound/outbound traffic and TCP ESTABLISHED count:

((sum(rate(node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100)

((sum(rate(node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100)

node_netstat_Tcp_CurrEstab

6.2 Container Monitoring

Containers are the runtime substrate in cloud‑native environments. Important metrics:

CPU – usage ratio (used / limit). Example query:

sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate * on(namespace,pod) group_left(workload, workload_type) mixin_pod_workload) by (workload, workload_type,namespace,pod) / sum(kube_pod_container_resource_limits_cpu_cores * on(namespace,pod) group_left(workload, workload_type) mixin_pod_workload) by (workload, workload_type,namespace,pod) * 100 > 80

Memory – used vs limit. Example query:

sum(container_memory_working_set_bytes * on(namespace,pod) group_left(workload, workload_type) mixin_pod_workload) by (namespace,pod) / sum(kube_pod_container_resource_limits_memory_bytes * on(namespace,pod) group_left(workload, workload_type) mixin_pod_workload) by (namespace,pod) * 100 / 2 > 80

Events – Kubernetes pod events (Warning vs Normal). Monitored via kube-eventer.

6.3 Application Service Monitoring

Key metrics include HTTP endpoint health, request latency, QPS, success rate, JVM statistics (GC count, GC time, memory regions, thread count, deadlocks), thread‑pool usage, connection‑pool usage, and business‑specific KPIs (e.g., PV, order volume).

Implementation steps (Spring Boot example):

Add Maven dependency:

<dependency>
  <groupId>io.prometheus</groupId>
  <artifactId>simpleclient_hotspot</artifactId>
  <version>0.6.0</version>
</dependency>

Initialize exporter in code:

@PostConstruct
public void initJvmExporter() {
    io.prometheus.client.hotspot.DefaultExports.initialize();
}

Configure port and path in application.properties:

management.port=8081
endpoints.prometheus.path=prometheus-metrics

Enable Prometheus endpoint in the main class:

@SpringBootApplication
@EnablePrometheusEndpoint
@EnableSpringBootMetricsCollector
public class PrometheusDemoApplication {
    public static void main(String[] args) {
        SpringApplication.run(PrometheusDemoApplication.class, args);
    }
}

Add service annotations for automatic discovery:

prometheus.io/scrape: 'true'
prometheus.io/path: '/prometheus-metrics'
prometheus.io/port: '8081'

6.4 Third‑Party Interface Monitoring

Monitor response time, availability, and success rate of external APIs using blackbox_exporter. This provides a unified view of internal and external service health and aids rapid root‑cause analysis.

7. Alerting

Define thresholds and severity levels to avoid noise. Prometheus evaluates PromQL expressions periodically and sends alerts to Alertmanager when conditions are met.

Alertmanager groups, silences, and routes alerts to various receivers (email, DingTalk, WeChat, webhook) based on severity.

8. Incident Handling Process

After an alert fires, a predefined on‑call workflow ensures timely response.

8.1 Fault Severity Classification

Level 1 – Critical: business impact > 1 hour, recovery > 24 hours.

Level 2 – Major: business impact > 1 hour, recovery within 24 hours.

Level 3 – Minor: performance degradation, recovery within 12 hours.

Level 4 – Warning: non‑critical issues, can be handled without affecting overall operation.

8.2 Fault Handling Procedure

Fault discovery : record time, reporter, and initial assessment.

Fault analysis : check recent changes, determine impact scope, locate root cause, and assign severity.

Fault escalation : follow escalation matrix based on severity (see attached table).

Resolution : apply remediation, verify recovery, and close the incident.

Images illustrating the escalation matrix and process flow are included in the original article.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring cloud-native Operations metrics Alerting incident management Prometheus

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.