Operations 24 min read

Mastering Monitoring: From Fundamentals to Prometheus in Cloud‑Native Environments

This comprehensive guide explains the purpose, models, and methods of monitoring across the entire software lifecycle, compares health checks, logging, tracing, and metric collection, and details practical implementations using tools like ELK, SkyWalking, and Prometheus for cloud‑native operations.

Efficient Ops
Efficient Ops
Efficient Ops
Mastering Monitoring: From Fundamentals to Prometheus in Cloud‑Native Environments

Purpose of Monitoring

Monitoring spans the whole application lifecycle—from design and development to deployment and decommissioning—and serves both technical teams and business needs by providing early fault detection, issue diagnosis, and post‑incident analysis.

Technical: Understand environment status, detect and resolve faults.

Business: Ensure continuous service operation.

Key objectives include 24/7 real‑time observation, timely status feedback, platform stability, service reliability, and sustained business operation.

Monitoring Models

Business monitoring: Define and alert on business metrics such as growth rate and error rate.

Application monitoring: Use probes (external checks) and introspection (internal metrics) to report status, components, transactions, and performance.

Operating system monitoring: Track resource usage and errors (e.g., CPU utilization, load).

Monitoring Methods

Health checks – verify service liveness.

Log analysis – locate and resolve issues.

Tracing – capture full request flow and latency.

Metric collection – aggregate time‑series data for trend analysis.

Health checks are often provided by cloud platforms, logs are collected by dedicated log centers, tracing relies on specialized solutions, and metric monitoring uses exporters to expose data for aggregation and alerting.

Note: The following focuses on metric monitoring.

Monitoring Tool Selection

Health Checks

Configure health checks directly in the cloud platform.

Log Management

Adopt mature open‑source solutions such as the ELK stack.

Tracing

Common tools include SkyWalking, Zikpin, Pinpoint, Elastic APM, and Cat. SkyWalking and Elastic APM are non‑intrusive and suitable for cloud‑native environments, supporting multiple languages (Java, Node.js, Go, etc.).

Metric Monitoring

In traditional setups Zabbix is preferred, while cloud‑native environments favor Prometheus due to its strong community support, simple deployment, pull‑based data collection, powerful data model, PromQL query language, extensive ecosystem, and high performance. Prometheus is not ideal for scenarios requiring 100% data accuracy.

Prometheus Monitoring System Overview

The overall architecture includes:

Prometheus Server – scrapes metrics and stores time‑series data.

Exporter – exposes metrics for scraping.

Pushgateway – receives metrics pushed by agents.

Alertmanager – handles alert routing and silencing.

Adhoc – provides query capabilities.

Data flow: Prometheus scrapes or receives data via Pushgateway, stores it in TSDB, evaluates rules, triggers alerts via Alertmanager, and visualizes data with tools like Grafana.

Metric Monitoring Targets

Monitoring objects are categorized into host, container, application service, and third‑party interface.

Host: CPU, memory, disk, availability, service status, network.

Container: CPU, memory, events.

Application service: HTTP endpoints, JVM, thread pool, connection pool, business metrics.

Third‑party interface: response time, availability, success rate.

Host Monitoring

Key metrics include usage, saturation, and error counts, collected via

node-exporter

and queried with PromQL.

CPU

Usage metric:

node_cpu_seconds_total

. Example alert for >80% usage:

100-(avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)*100) > 80

Saturation metric:

node_load

. Example alert for load exceeding twice the CPU count:

node_load5 > on (instance) 2 * count by(instance)(node_cpu_seconds_total{mode="idle"})

Memory

Usage calculated from free, buffer, and cache memory. Example alert for >80% usage:

100 - sum(node_memory_MemFree_bytes{job="node-exporter"} + node_memory_Buffers_bytes{job="node-exporter"} + node_memory_Cached_bytes{job="node-exporter"}) by (instance) / sum(node_memory_MemTotal_bytes{job="node-exporter"}) by (instance) * 100 > 80

Disk

Monitor growth trends rather than raw usage. Example prediction for exhaustion within 4 hours:

predict_linear(node_filesystem_free_bytes{job="node-exporter",mountpoint!=""}[1h], 4*3600)

Combine with usage threshold:

(100 - (node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100) > 80) and (predict_linear(node_filesystem_free_bytes{job="node-exporter",mountpoint!="",device!="rootfs"}[1h],4*3600) < 0)

Network

Incoming traffic:

((sum(rate(node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100

Outgoing traffic:

((sum(rate(node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100

TCP established connections:

node_netstat_Tcp_CurrEstab

Container Monitoring

Metrics are collected via cAdvisor (integrated in kubelet). Example CPU usage alert:

sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate * on(namespace,pod) group_left(workload, workload_type) mixin_pod_workload) by (workload, workload_type,namespace,pod) / sum(kube_pod_container_resource_limits_cpu_cores * on(namespace,pod) group_left(workload, workload_type) mixin_pod_workload) by (workload, workload_type,namespace,pod) * 100 > 80

Memory usage alert:

sum(container_memory_working_set_bytes * on(namespace,pod) group_left(workload, workload_type) mixin_pod_workload) by (namespace,pod) / sum(kube_pod_container_resource_limits_memory_bytes * on(namespace,pod) group_left(workload, workload_type) mixin_pod_workload) by (namespace,pod) * 100 > 80

Application Service Monitoring

Monitor HTTP endpoints, JVM metrics (GC, memory, threads), thread pool, connection pool, and business KPIs (e.g., PV, order volume). Use

blackbox_exporter

for endpoint health and expose JVM metrics via

simpleclient_hotspot

library.

<dependency>
    <groupId>io.prometheus</groupId>
    <artifactId>simpleclient_hotspot</artifactId>
    <version>0.6.0</version>
</dependency>
@PostConstruct
public void initJvmExporter() {
    io.prometheus.client.hotspot.DefaultExports.initialize();
}
management.port: 8081
endpoints.prometheus.path: prometheus-metrics

Enable Prometheus endpoint in Spring Boot:

@SpringBootApplication
@EnablePrometheusEndpoint
@EnableSpringBootMetricsCollector
public class PrometheusDemoApplication {
    public static void main(String[] args) {
        SpringApplication.run(PrometheusDemoApplication.class, args);
    }
}

Add annotations for automatic discovery:

prometheus.io/scrape: 'true'
prometheus.io/path: '/prometheus-metrics'
prometheus.io/port: '8081'

Third‑Party Interface Monitoring

Track response time, availability, and success rate using

blackbox_exporter

.

Alerting and Notification

Define thresholds and severity levels to avoid noisy alerts. Prometheus evaluates alert rules via PromQL and sends alerts to Alertmanager, which groups, silences, and routes them to receivers (email, DingTalk, WeChat, webhook, etc.).

Fault Handling Process

Establish clear fault levels (1‑4) and a structured response workflow: detection, investigation, escalation, and resolution, with on‑call mechanisms to ensure timely handling.

Prometheus architecture diagram
Prometheus architecture diagram
MonitoringCloud NativeOperationsmetricsPrometheus
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.