Mastering Monitoring: From Basics to Prometheus in Cloud‑Native Environments
Monitoring is essential throughout an application's lifecycle, covering purpose, modes, methods, tool selection, and detailed Prometheus implementation for host, container, service, and third‑party monitoring, along with alerting strategies and fault‑handling processes to ensure stable, secure operations.
Monitoring is the most important part of operations and the entire product lifecycle. It aims to provide early warnings before failures, help locate issues during incidents, and supply data for post‑mortem analysis.
1. Purpose of Monitoring
Monitoring spans the whole application lifecycle—from design, development, deployment to decommissioning—and serves both technical and business stakeholders. Its goals are to provide 24/7 real‑time visibility, timely status feedback, platform stability, service reliability, and continuous business operation.
2. Monitoring Modes
Monitoring can be divided from top to bottom into:
Business monitoring
Application monitoring
Operating system monitoring
Business monitoring focuses on business metrics and alerts. Application monitoring includes probes (external checks) and introspection (internal metrics). OS monitoring tracks component usage such as CPU utilization and load.
3. Monitoring Methods
The main methods are:
Health checks – verify service liveness.
Log analysis – provide rich information for troubleshooting.
Tracing – present full request flow, latency, and service chain.
Metric collection – time‑series data that can be aggregated and visualized.
Health checks are usually provided by cloud platforms; logs are collected by dedicated log centers; tracing uses specialized solutions; metric monitoring relies on exporters that expose targets to Prometheus, which then aggregates, visualizes, and alerts on the data.
Note: The following section focuses on metric monitoring.
4. Monitoring Tool Selection
4.1 Health Checks
Cloud platforms usually provide built‑in health‑check capabilities that can be configured directly.
4.2 Logs
The mature open‑source solution for log aggregation is the ELK stack.
4.3 Tracing
Common tracing tools include SkyWalking, Zikpin, Pinpoint, Elastic APM, and Cat. SkyWalking and Elastic APM use bytecode injection and are non‑intrusive, supporting multiple languages (Java, Node.js, Go, etc.) and are well‑suited for cloud‑native environments.
4.4 Metric Monitoring
In traditional environments Zabbix is a common choice, but for cloud‑native setups Prometheus has become the de‑facto standard because of its strong community support, ease of deployment, pull‑based data collection, powerful PromQL query language, rich ecosystem, and high performance.
Note: Prometheus may lose data under heavy load, so it is not suitable for scenarios requiring 100% data accuracy.
5. Overview of the Prometheus Monitoring System
The overall architecture is illustrated below:
Prometheus Server – scrapes metrics and stores time‑series data.
Exporter – exposes metrics for Prometheus to scrape.
Pushgateway – allows metrics to be pushed.
Alertmanager – handles alert routing and silencing.
Adhoc – used for data queries.
Prometheus pulls data via exporters or Pushgateway, stores it in a TSDB, evaluates alerting rules with PromQL, and visualizes data through Grafana or similar tools.
6. Monitoring Targets
Monitoring objects are typically layered. The main categories are:
Host monitoring – CPU, memory, disk, availability, service status, network.
Container monitoring – CPU, memory, events.
Application service monitoring – HTTP endpoints, JVM metrics, thread pools, connection pools.
Third‑party API monitoring – response time, availability, success rate.
6.1 Host Monitoring
6.1.1 Why Host Monitoring Is Needed
Hosts are the foundation for all services; a host failure can cause widespread outages. Early detection prevents severe incidents.
6.1.2 How to Assess Resource Status
Three key aspects are considered:
Utilization – average busy time as a percentage.
Saturation – length of resource queues.
Errors – count of error events.
6.1.3 Resources to Monitor
CPU
Memory
Disk
Availability
Service status
Network
6.1.4 Monitoring Implementation
In Prometheus, host metrics are collected by
node-exporterand queried with PromQL. Example expressions:
<code>100-(avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 60</code>CPU saturation example:
<code>node_load5 > on (instance) 2 * count by (instance) (node_cpu_seconds_total{mode="idle"})</code>Memory usage example (alert when >80%):
<code>100 - sum(node_memory_MemFree_bytes{job="node-exporter"} + node_memory_Buffers_bytes{job="node-exporter"} + node_memory_Cached_bytes{job="node-exporter"}) by (instance) / sum(node_memory_MemTotal_bytes{job="node-exporter"}) by (instance) * 100 > 80</code>Disk usage trend prediction example:
<code>predict_linear(node_filesystem_free_bytes{job="node-exporter",mountpoint!=""}[1h], 4*3600)</code>Disk I/O utilization example:
<code>100-(avg(irate(node_disk_io_time_seconds_total[1m])) by (instance) * 100)</code>6.2 Container Monitoring
6.2.1 Why Container Monitoring Is Needed
Containers are the runtime units of cloud‑native applications; exceeding CPU or memory limits can cause OOM and service disruption.
6.2.2 Key Metrics
CPU usage
Memory usage
Events (Kubernetes pod events)
6.2.3 Implementation
cAdvisor (integrated into kubelet) provides container metrics. Example CPU usage alert (>80%):
<code>sum(
node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate * on(namespace,pod)
group_left(workload, workload_type) mixin_pod_workload
) by (workload, workload_type, namespace, pod)
/
sum(
kube_pod_container_resource_limits_cpu_cores * on(namespace,pod)
group_left(workload, workload_type) mixin_pod_workload
) by (workload, workload_type, namespace, pod) * 100 > 80</code>Memory usage alert (>80%):
<code>sum(
container_memory_working_set_bytes * on(namespace,pod)
group_left(workload, workload_type) mixin_pod_workload
) by (namespace,pod)
/
sum(
kube_pod_container_resource_limits_memory_bytes * on(namespace,pod)
group_left(workload, workload_type) mixin_pod_workload
) by (namespace,pod) * 100 / 2 > 80</code>Kubernetes events are monitored via
kube-eventerto catch warning events such as OOM or pod failures.
6.3 Application Service Monitoring
6.3.1 Why It Matters
Application health directly impacts user experience and business continuity. Without proper monitoring, failures may go undetected, performance cannot be measured, and business KPIs remain invisible.
6.3.2 Typical Metrics
HTTP endpoints – availability, request count, latency, error rate.
JVM – GC count/duration, memory pool sizes, thread count, deadlocks.
Thread pools – active threads, queue size, task latency, rejected tasks.
Connection pools – total and active connections.
Business KPIs – e.g., PV, order volume.
6.3.3 How to Monitor
Use Prometheus exporters such as
blackbox_exporterfor HTTP health checks, and expose JVM metrics via the
simpleclient_hotspotlibrary. Example Maven dependency:
<code><dependency>
<groupId>io.prometheus</groupId>
<artifactId>simpleclient_hotspot</artifactId>
<version>0.6.0</version>
</dependency></code>Initialize the exporter in application code:
<code>@PostConstruct
public void initJvmExporter() {
io.prometheus.client.hotspot.DefaultExports.initialize();
}</code>Expose the metrics endpoint (e.g., port 8081, path
/prometheus-metrics) and annotate services with Prometheus scrape annotations for automatic discovery.
6.4 Third‑Party API Monitoring
6.4.1 Why It Is Critical
External services affect your own business; monitoring their latency, availability, and success rate helps quickly locate upstream issues.
6.4.2 Metrics
Response time
Availability
Success rate
6.4.3 Implementation
Use
blackbox_exporterto probe external endpoints. The collected data can be visualized alongside internal metrics for a unified view.
7. Alert Notification
Define threshold values and severity levels to avoid noisy alerts. Prometheus evaluates alerting rules written in PromQL and forwards triggered alerts to Alertmanager, which can group, mute, or route them to various receivers (email, DingTalk, WeChat, webhook, etc.).
8. Fault‑Handling Process
8.1 Fault Severity Levels
Faults are classified into four levels based on impact and recovery time, ranging from critical (level 1) to minor (level 4).
8.2 Fault Handling Procedure
Detect the fault, record details, and perform an initial assessment.
Investigate the root cause, determine scope, and assign a severity.
Follow the appropriate escalation path based on severity.
8.3 Fault Escalation Timeline
Specific reporting deadlines are defined for each severity level (e.g., immediate reporting for level 1, within 30 minutes for level 2, etc.).
8.4 Fault Handling Flowchart
Following a structured process ensures timely resolution and minimizes business impact.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.