Cloud Native 8 min read

Mastering Prometheus: Build a Cloud‑Native Monitoring System from Scratch

This article explains how to design a Prometheus‑based cloud‑native monitoring solution, covering target selection, metric collection, server configuration, Grafana visualization, and alert management with practical examples and code snippets.

Efficient Ops
Efficient Ops
Efficient Ops
Mastering Prometheus: Build a Cloud‑Native Monitoring System from Scratch

1. Monitoring Targets

Prometheus can monitor infrastructure, middleware, databases, containers, and SaaS services. Typical metrics for each layer (IaaS, PaaS, SaaS) and their collection frequencies are outlined.

IaaS layer : physical/virtual machines – server status, CPU, memory, disk I/O, network traffic, bandwidth, etc. (second‑ or minute‑level).

PaaS layer : databases – cluster status, connections, slow queries, locks, memory usage; middleware – status, connections, sessions; containers – runtime, pod status, resource usage (minute‑ or second‑level).

SaaS layer : application services – availability, request count, response time, HTTP status codes (second‑ or minute‑level).

2. Data Collection

Deploy appropriate exporters or probes on each node. Example:

node_exporter

on servers exposes metrics on port 9100, which Prometheus scrapes via HTTP.

Example command: curl localhost:9100/metrics
<code># TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 5.76669737e+06
# TYPE node_disk_info gauge
node_disk_info{device="dm-0",major="252",minor="0"} 1
# ... additional metric lines ...
</code>

MySQL exporter provides metrics such as

mysql_global_variables_thread_cache_size

and

mysql_global_variables_thread_stack

.

<code># TYPE mysql_global_variables_thread_cache_size gauge
mysql_global_variables_thread_cache_size 9
# TYPE mysql_global_variables_thread_stack gauge
mysql_global_variables_thread_stack 262144
# ... additional metric lines ...
</code>

3. Prometheus Server Configuration

Example job for node metrics:

<code>- job_name: 'node'
  metric_path: /metrics
  scheme: http
  scrape_interval: 30s
  scrape_timeout: 20s
  file_sd_configs:
    - files: ['/prom/targets/node.yml']
      refresh_interval: 30s
</code>

Labels add key‑value pairs for later querying;

targets

list the actual endpoints (e.g., 192.168.0.1:9100).

4. Visualization with Grafana

Grafana connects to Prometheus as a data source and offers dashboards for resource overview, Kubernetes pod status, namespace statistics, and time‑range queries.

Grafana resource overview
Grafana resource overview
Grafana pod overview
Grafana pod overview

Grafana also manages user permissions through concepts of org, team, role, and user.

5. Alert Management

Alert rules define conditions, duration, severity, and annotations. Example alerts for high memory and disk usage are provided.

<code>- alert: HighMemoryUsage
  expr: 100 - (node_memory_MemAvailable_bytes{project='xx'} / node_memory_MemTotal_bytes{project='xx'}) * 100 > 98
  for: 5m
  labels:
    severity: critical
    type: server
  annotations:
    summary: "{{ $labels.mountpoint }} memory usage high!"
    description: "Memory usage exceeds 98% (current: {{ printf \"%.2f\" $value }}%)"
</code>

Alert states: Inactive, Pending, Firing. Alerts can be sent via email, DingTalk, WeChat, SMS, or webhook.

ObservabilityalertingPrometheusGrafanaCloud Native Monitoring
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.