Operations 27 min read

Mastering Monitoring: From Basics to Prometheus in Cloud‑Native Environments

Monitoring is essential throughout an application's lifecycle, covering purpose, modes, methods, tool selection, and detailed Prometheus implementation for host, container, service, and third‑party monitoring, along with alerting strategies and fault‑handling processes to ensure stable, secure operations.

Ops Development Stories

Oct 30, 2020

Mastering Monitoring: From Basics to Prometheus in Cloud‑Native Environments

Monitoring is the most important part of operations and the entire product lifecycle. It aims to provide early warnings before failures, help locate issues during incidents, and supply data for post‑mortem analysis.

1. Purpose of Monitoring

Monitoring spans the whole application lifecycle—from design, development, deployment to decommissioning—and serves both technical and business stakeholders. Its goals are to provide 24/7 real‑time visibility, timely status feedback, platform stability, service reliability, and continuous business operation.

2. Monitoring Modes

Monitoring can be divided from top to bottom into:

Business monitoring

Application monitoring

Operating system monitoring

Business monitoring focuses on business metrics and alerts. Application monitoring includes probes (external checks) and introspection (internal metrics). OS monitoring tracks component usage such as CPU utilization and load.

3. Monitoring Methods

The main methods are:

Health checks – verify service liveness.

Log analysis – provide rich information for troubleshooting.

Tracing – present full request flow, latency, and service chain.

Metric collection – time‑series data that can be aggregated and visualized.

Health checks are usually provided by cloud platforms; logs are collected by dedicated log centers; tracing uses specialized solutions; metric monitoring relies on exporters that expose targets to Prometheus, which then aggregates, visualizes, and alerts on the data.

Note: The following section focuses on metric monitoring.

4. Monitoring Tool Selection

4.1 Health Checks

Cloud platforms usually provide built‑in health‑check capabilities that can be configured directly.

4.2 Logs

The mature open‑source solution for log aggregation is the ELK stack.

4.3 Tracing

Common tracing tools include SkyWalking, Zikpin, Pinpoint, Elastic APM, and Cat. SkyWalking and Elastic APM use bytecode injection and are non‑intrusive, supporting multiple languages (Java, Node.js, Go, etc.) and are well‑suited for cloud‑native environments.

4.4 Metric Monitoring

In traditional environments Zabbix is a common choice, but for cloud‑native setups Prometheus has become the de‑facto standard because of its strong community support, ease of deployment, pull‑based data collection, powerful PromQL query language, rich ecosystem, and high performance.

Note: Prometheus may lose data under heavy load, so it is not suitable for scenarios requiring 100% data accuracy.

5. Overview of the Prometheus Monitoring System

The overall architecture is illustrated below:

Prometheus Server – scrapes metrics and stores time‑series data.

Exporter – exposes metrics for Prometheus to scrape.

Pushgateway – allows metrics to be pushed.

Alertmanager – handles alert routing and silencing.

Adhoc – used for data queries.

Prometheus pulls data via exporters or Pushgateway, stores it in a TSDB, evaluates alerting rules with PromQL, and visualizes data through Grafana or similar tools.

6. Monitoring Targets

Monitoring objects are typically layered. The main categories are:

Host monitoring – CPU, memory, disk, availability, service status, network.

Container monitoring – CPU, memory, events.

Application service monitoring – HTTP endpoints, JVM metrics, thread pools, connection pools.

Third‑party API monitoring – response time, availability, success rate.

6.1 Host Monitoring

6.1.1 Why Host Monitoring Is Needed

Hosts are the foundation for all services; a host failure can cause widespread outages. Early detection prevents severe incidents.

6.1.2 How to Assess Resource Status

Three key aspects are considered:

Utilization – average busy time as a percentage.

Saturation – length of resource queues.

Errors – count of error events.

6.1.3 Resources to Monitor

CPU

Memory

Disk

Availability

Service status

Network

6.1.4 Monitoring Implementation

In Prometheus, host metrics are collected by node-exporter and queried with PromQL. Example expressions:

100-(avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 60

CPU saturation example:

node_load5 > on (instance) 2 * count by (instance) (node_cpu_seconds_total{mode="idle"})

Memory usage example (alert when >80%):

100 - sum(node_memory_MemFree_bytes{job="node-exporter"} + node_memory_Buffers_bytes{job="node-exporter"} + node_memory_Cached_bytes{job="node-exporter"}) by (instance) / sum(node_memory_MemTotal_bytes{job="node-exporter"}) by (instance) * 100 > 80

Disk usage trend prediction example:

predict_linear(node_filesystem_free_bytes{job="node-exporter",mountpoint!=""}[1h], 4*3600)

Disk I/O utilization example:

100-(avg(irate(node_disk_io_time_seconds_total[1m])) by (instance) * 100)

6.2 Container Monitoring

6.2.1 Why Container Monitoring Is Needed

Containers are the runtime units of cloud‑native applications; exceeding CPU or memory limits can cause OOM and service disruption.

6.2.2 Key Metrics

CPU usage

Memory usage

Events (Kubernetes pod events)

6.2.3 Implementation

cAdvisor (integrated into kubelet) provides container metrics. Example CPU usage alert (>80%):

sum(
  node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate * on(namespace,pod)
  group_left(workload, workload_type) mixin_pod_workload
) by (workload, workload_type, namespace, pod)
/
sum(
  kube_pod_container_resource_limits_cpu_cores * on(namespace,pod)
  group_left(workload, workload_type) mixin_pod_workload
) by (workload, workload_type, namespace, pod) * 100 > 80

Memory usage alert (>80%):

sum(
  container_memory_working_set_bytes * on(namespace,pod)
  group_left(workload, workload_type) mixin_pod_workload
) by (namespace,pod)
/
sum(
  kube_pod_container_resource_limits_memory_bytes * on(namespace,pod)
  group_left(workload, workload_type) mixin_pod_workload
) by (namespace,pod) * 100 / 2 > 80

Kubernetes events are monitored via kube-eventer to catch warning events such as OOM or pod failures.

6.3 Application Service Monitoring

6.3.1 Why It Matters

Application health directly impacts user experience and business continuity. Without proper monitoring, failures may go undetected, performance cannot be measured, and business KPIs remain invisible.

6.3.2 Typical Metrics

HTTP endpoints – availability, request count, latency, error rate.

JVM – GC count/duration, memory pool sizes, thread count, deadlocks.

Thread pools – active threads, queue size, task latency, rejected tasks.

Connection pools – total and active connections.

Business KPIs – e.g., PV, order volume.

6.3.3 How to Monitor

Use Prometheus exporters such as blackbox_exporter for HTTP health checks, and expose JVM metrics via the simpleclient_hotspot library. Example Maven dependency:

<dependency>
    <groupId>io.prometheus</groupId>
    <artifactId>simpleclient_hotspot</artifactId>
    <version>0.6.0</version>
</dependency>

Initialize the exporter in application code:

@PostConstruct
public void initJvmExporter() {
    io.prometheus.client.hotspot.DefaultExports.initialize();
}

Expose the metrics endpoint (e.g., port 8081, path /prometheus-metrics) and annotate services with Prometheus scrape annotations for automatic discovery.

6.4 Third‑Party API Monitoring

6.4.1 Why It Is Critical

External services affect your own business; monitoring their latency, availability, and success rate helps quickly locate upstream issues.

6.4.2 Metrics

Response time

Availability

Success rate

6.4.3 Implementation

Use blackbox_exporter to probe external endpoints. The collected data can be visualized alongside internal metrics for a unified view.

7. Alert Notification

Define threshold values and severity levels to avoid noisy alerts. Prometheus evaluates alerting rules written in PromQL and forwards triggered alerts to Alertmanager, which can group, mute, or route them to various receivers (email, DingTalk, WeChat, webhook, etc.).

8. Fault‑Handling Process

8.1 Fault Severity Levels

Faults are classified into four levels based on impact and recovery time, ranging from critical (level 1) to minor (level 4).

8.2 Fault Handling Procedure

Detect the fault, record details, and perform an initial assessment.

Investigate the root cause, determine scope, and assign a severity.

Follow the appropriate escalation path based on severity.

8.3 Fault Escalation Timeline

Specific reporting deadlines are defined for each severity level (e.g., immediate reporting for level 1, within 30 minutes for level 2, etc.).

8.4 Fault Handling Flowchart

Following a structured process ensures timely resolution and minimizes business impact.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud-native Metrics

Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.