Mastering Monitoring: From Fundamentals to Prometheus in Cloud‑Native Environments
This comprehensive guide explains the purpose, models, and methods of monitoring across the entire software lifecycle, compares health checks, logging, tracing, and metric collection, and details practical implementations using tools like ELK, SkyWalking, and Prometheus for cloud‑native operations.
Purpose of Monitoring
Monitoring spans the whole application lifecycle—from design and development to deployment and decommissioning—and serves both technical teams and business needs by providing early fault detection, issue diagnosis, and post‑incident analysis.
Technical: Understand environment status, detect and resolve faults.
Business: Ensure continuous service operation.
Key objectives include 24/7 real‑time observation, timely status feedback, platform stability, service reliability, and sustained business operation.
Monitoring Models
Business monitoring: Define and alert on business metrics such as growth rate and error rate.
Application monitoring: Use probes (external checks) and introspection (internal metrics) to report status, components, transactions, and performance.
Operating system monitoring: Track resource usage and errors (e.g., CPU utilization, load).
Monitoring Methods
Health checks – verify service liveness.
Log analysis – locate and resolve issues.
Tracing – capture full request flow and latency.
Metric collection – aggregate time‑series data for trend analysis.
Health checks are often provided by cloud platforms, logs are collected by dedicated log centers, tracing relies on specialized solutions, and metric monitoring uses exporters to expose data for aggregation and alerting.
Note: The following focuses on metric monitoring.
Monitoring Tool Selection
Health Checks
Configure health checks directly in the cloud platform.
Log Management
Adopt mature open‑source solutions such as the ELK stack.
Tracing
Common tools include SkyWalking, Zikpin, Pinpoint, Elastic APM, and Cat. SkyWalking and Elastic APM are non‑intrusive and suitable for cloud‑native environments, supporting multiple languages (Java, Node.js, Go, etc.).
Metric Monitoring
In traditional setups Zabbix is preferred, while cloud‑native environments favor Prometheus due to its strong community support, simple deployment, pull‑based data collection, powerful data model, PromQL query language, extensive ecosystem, and high performance. Prometheus is not ideal for scenarios requiring 100% data accuracy.
Prometheus Monitoring System Overview
The overall architecture includes:
Prometheus Server – scrapes metrics and stores time‑series data.
Exporter – exposes metrics for scraping.
Pushgateway – receives metrics pushed by agents.
Alertmanager – handles alert routing and silencing.
Adhoc – provides query capabilities.
Data flow: Prometheus scrapes or receives data via Pushgateway, stores it in TSDB, evaluates rules, triggers alerts via Alertmanager, and visualizes data with tools like Grafana.
Metric Monitoring Targets
Monitoring objects are categorized into host, container, application service, and third‑party interface.
Host: CPU, memory, disk, availability, service status, network.
Container: CPU, memory, events.
Application service: HTTP endpoints, JVM, thread pool, connection pool, business metrics.
Third‑party interface: response time, availability, success rate.
Host Monitoring
Key metrics include usage, saturation, and error counts, collected via
node-exporterand queried with PromQL.
CPU
Usage metric:
node_cpu_seconds_total. Example alert for >80% usage:
100-(avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)*100) > 80Saturation metric:
node_load. Example alert for load exceeding twice the CPU count:
node_load5 > on (instance) 2 * count by(instance)(node_cpu_seconds_total{mode="idle"})Memory
Usage calculated from free, buffer, and cache memory. Example alert for >80% usage:
100 - sum(node_memory_MemFree_bytes{job="node-exporter"} + node_memory_Buffers_bytes{job="node-exporter"} + node_memory_Cached_bytes{job="node-exporter"}) by (instance) / sum(node_memory_MemTotal_bytes{job="node-exporter"}) by (instance) * 100 > 80Disk
Monitor growth trends rather than raw usage. Example prediction for exhaustion within 4 hours:
predict_linear(node_filesystem_free_bytes{job="node-exporter",mountpoint!=""}[1h], 4*3600)Combine with usage threshold:
(100 - (node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100) > 80) and (predict_linear(node_filesystem_free_bytes{job="node-exporter",mountpoint!="",device!="rootfs"}[1h],4*3600) < 0)Network
Incoming traffic:
((sum(rate(node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100Outgoing traffic:
((sum(rate(node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100TCP established connections:
node_netstat_Tcp_CurrEstabContainer Monitoring
Metrics are collected via cAdvisor (integrated in kubelet). Example CPU usage alert:
sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate * on(namespace,pod) group_left(workload, workload_type) mixin_pod_workload) by (workload, workload_type,namespace,pod) / sum(kube_pod_container_resource_limits_cpu_cores * on(namespace,pod) group_left(workload, workload_type) mixin_pod_workload) by (workload, workload_type,namespace,pod) * 100 > 80Memory usage alert:
sum(container_memory_working_set_bytes * on(namespace,pod) group_left(workload, workload_type) mixin_pod_workload) by (namespace,pod) / sum(kube_pod_container_resource_limits_memory_bytes * on(namespace,pod) group_left(workload, workload_type) mixin_pod_workload) by (namespace,pod) * 100 > 80Application Service Monitoring
Monitor HTTP endpoints, JVM metrics (GC, memory, threads), thread pool, connection pool, and business KPIs (e.g., PV, order volume). Use
blackbox_exporterfor endpoint health and expose JVM metrics via
simpleclient_hotspotlibrary.
<dependency>
<groupId>io.prometheus</groupId>
<artifactId>simpleclient_hotspot</artifactId>
<version>0.6.0</version>
</dependency> @PostConstruct
public void initJvmExporter() {
io.prometheus.client.hotspot.DefaultExports.initialize();
} management.port: 8081
endpoints.prometheus.path: prometheus-metricsEnable Prometheus endpoint in Spring Boot:
@SpringBootApplication
@EnablePrometheusEndpoint
@EnableSpringBootMetricsCollector
public class PrometheusDemoApplication {
public static void main(String[] args) {
SpringApplication.run(PrometheusDemoApplication.class, args);
}
}Add annotations for automatic discovery:
prometheus.io/scrape: 'true'
prometheus.io/path: '/prometheus-metrics'
prometheus.io/port: '8081'Third‑Party Interface Monitoring
Track response time, availability, and success rate using
blackbox_exporter.
Alerting and Notification
Define thresholds and severity levels to avoid noisy alerts. Prometheus evaluates alert rules via PromQL and sends alerts to Alertmanager, which groups, silences, and routes them to receivers (email, DingTalk, WeChat, webhook, etc.).
Fault Handling Process
Establish clear fault levels (1‑4) and a structured response workflow: detection, investigation, escalation, and resolution, with on‑call mechanisms to ensure timely handling.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.