Mastering End‑to‑End Monitoring: From Purpose to Prometheus Implementation
This guide explains why monitoring is essential throughout a product lifecycle, outlines monitoring modes and methods, compares health checks, logs, tracing and metric solutions, and provides a detailed Prometheus‑based monitoring architecture with concrete metric definitions, alerting rules, and incident‑response procedures.
1. Purpose of Monitoring
Monitoring is the most critical part of operations and the entire product lifecycle. It aims to provide early warnings before failures, help locate problems during incidents, and supply data for post‑mortem analysis.
It serves both technical teams (understanding environment status) and business teams (ensuring continuous service). The core objectives are:
24×7 real‑time monitoring
Timely feedback of system status
Guarantee platform stability
Ensure service safety and reliability
Support continuous business operation
2. Monitoring Modes
Monitoring can be divided from top to bottom into:
Business monitoring – predefined business metrics, growth rate, error rate, etc.
Application monitoring – probes (external checks) and introspection (internal metrics).
Operating‑system monitoring – CPU, load, memory, disk usage, etc.
3. Monitoring Methods
The main methods are:
Health checks : verify whether a service is alive.
Log analysis : use logs to locate and solve problems.
Tracing (call‑chain monitoring) : capture the full request path and latency.
Metric monitoring : collect time‑series data for trend analysis and alerting.
Health checks are usually provided by cloud platforms, logs are collected by a log center, tracing has dedicated solutions (e.g., SkyWalking, Elastic APM), and metrics are gathered by exporters and processed by Prometheus.
4. Tool Selection
Health check : configure directly in the cloud platform.
Log : mature open‑source stack ELK.
Tracing : SkyWalking, Elastic APM, Pinpoint (Java/PHP only), Zikpin, Cat. SkyWalking and Elastic APM are non‑intrusive and support many languages; they are well suited for cloud‑native environments.
Metrics : In traditional environments Zabbix is common; in cloud‑native environments Prometheus is preferred because of its strong community, single‑binary deployment, pull model, powerful PromQL, rich ecosystem, and high performance.
5. Prometheus Monitoring System Overview
The overall architecture consists of:
Prometheus Server – scrapes metrics and stores time‑series data.
Exporter – exposes metrics for Prometheus to pull.
Pushgateway – receives pushed metrics.
Alertmanager – handles alert routing, silencing, and grouping.
Ad‑hoc query interface – for manual data inspection.
Data flow: exporters expose metrics → Prometheus pulls (or receives via Pushgateway) → stores in TSDB → Alertmanager evaluates rules → Grafana or other UI visualises data.
6. Metric Monitoring Objects
Monitoring objects are layered. The guide focuses on four categories:
Host monitoring – CPU, memory, disk, availability, service status, network.
Container environment monitoring – container CPU, memory, events.
Application service monitoring – HTTP endpoints, JVM, thread pools, connection pools, business KPIs.
Third‑party interface monitoring – response time, availability, success rate.
6.1 Host Monitoring
Key aspects:
Utilisation (average busy time)
Saturation (queue length)
Errors (error count)
Typical resources:
CPU – use node_cpu_seconds_total and alert when average usage over 5 minutes exceeds 60 %:
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 60Load – node_load5 compared to CPU count:
node_load5 > on (instance) 2 * count by(instance)(node_cpu_seconds_total{mode="idle"})Memory – total, free, buffers, cached. Alert when used > 80 %:
100 - sum(node_memory_MemFree_bytes{job="node-exporter"} + node_memory_Buffers_bytes{job="node-exporter"} + node_memory_Cached_bytes{job="node-exporter"}) by (instance) / sum(node_memory_MemTotal_bytes{job="node-exporter"}) by (instance) * 100 > 80Disk – monitor growth trend with predict_linear and combine with utilisation > 80 %:
(100 - (node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100) > 80) and (predict_linear(node_filesystem_free_bytes{job="node-exporter",mountpoint!=""}[1h], 4*3600) < 0)Disk I/O – example metric:
100 - (avg(irate(node_disk_io_time_seconds_total[1m])) by (instance) * 100)Availability – up{job="node-exporter"} == 0 indicates host down.
Service status –
node_systemd_unit_state{name="docker.service",state="active"} == 1.
Network – inbound/outbound traffic and TCP ESTABLISHED count:
((sum(rate(node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) ((sum(rate(node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) node_netstat_Tcp_CurrEstab6.2 Container Monitoring
Containers are the runtime substrate in cloud‑native environments. Important metrics:
CPU – usage ratio (used / limit). Example query:
sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate * on(namespace,pod) group_left(workload, workload_type) mixin_pod_workload) by (workload, workload_type,namespace,pod) / sum(kube_pod_container_resource_limits_cpu_cores * on(namespace,pod) group_left(workload, workload_type) mixin_pod_workload) by (workload, workload_type,namespace,pod) * 100 > 80Memory – used vs limit. Example query:
sum(container_memory_working_set_bytes * on(namespace,pod) group_left(workload, workload_type) mixin_pod_workload) by (namespace,pod) / sum(kube_pod_container_resource_limits_memory_bytes * on(namespace,pod) group_left(workload, workload_type) mixin_pod_workload) by (namespace,pod) * 100 / 2 > 80Events – Kubernetes pod events (Warning vs Normal). Monitored via kube-eventer.
6.3 Application Service Monitoring
Key metrics include HTTP endpoint health, request latency, QPS, success rate, JVM statistics (GC count, GC time, memory regions, thread count, deadlocks), thread‑pool usage, connection‑pool usage, and business‑specific KPIs (e.g., PV, order volume).
Implementation steps (Spring Boot example):
Add Maven dependency:
<dependency>
<groupId>io.prometheus</groupId>
<artifactId>simpleclient_hotspot</artifactId>
<version>0.6.0</version>
</dependency>Initialize exporter in code:
@PostConstruct
public void initJvmExporter() {
io.prometheus.client.hotspot.DefaultExports.initialize();
}Configure port and path in application.properties:
management.port=8081
endpoints.prometheus.path=prometheus-metricsEnable Prometheus endpoint in the main class:
@SpringBootApplication
@EnablePrometheusEndpoint
@EnableSpringBootMetricsCollector
public class PrometheusDemoApplication {
public static void main(String[] args) {
SpringApplication.run(PrometheusDemoApplication.class, args);
}
}Add service annotations for automatic discovery:
prometheus.io/scrape: 'true'
prometheus.io/path: '/prometheus-metrics'
prometheus.io/port: '8081'6.4 Third‑Party Interface Monitoring
Monitor response time, availability, and success rate of external APIs using blackbox_exporter. This provides a unified view of internal and external service health and aids rapid root‑cause analysis.
7. Alerting
Define thresholds and severity levels to avoid noise. Prometheus evaluates PromQL expressions periodically and sends alerts to Alertmanager when conditions are met.
Alertmanager groups, silences, and routes alerts to various receivers (email, DingTalk, WeChat, webhook) based on severity.
8. Incident Handling Process
After an alert fires, a predefined on‑call workflow ensures timely response.
8.1 Fault Severity Classification
Level 1 – Critical: business impact > 1 hour, recovery > 24 hours.
Level 2 – Major: business impact > 1 hour, recovery within 24 hours.
Level 3 – Minor: performance degradation, recovery within 12 hours.
Level 4 – Warning: non‑critical issues, can be handled without affecting overall operation.
8.2 Fault Handling Procedure
Fault discovery : record time, reporter, and initial assessment.
Fault analysis : check recent changes, determine impact scope, locate root cause, and assign severity.
Fault escalation : follow escalation matrix based on severity (see attached table).
Resolution : apply remediation, verify recovery, and close the incident.
Images illustrating the escalation matrix and process flow are included in the original article.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
