Mastering Prometheus: Build a Cloud‑Native Monitoring System from Scratch
This article explains how to design a Prometheus‑based cloud‑native monitoring solution, covering target selection, metric collection, server configuration, Grafana visualization, and alert management with practical examples and code snippets.
1. Monitoring Targets
Prometheus can monitor infrastructure, middleware, databases, containers, and SaaS services. Typical metrics for each layer (IaaS, PaaS, SaaS) and their collection frequencies are outlined.
IaaS layer : physical/virtual machines – server status, CPU, memory, disk I/O, network traffic, bandwidth, etc. (second‑ or minute‑level).
PaaS layer : databases – cluster status, connections, slow queries, locks, memory usage; middleware – status, connections, sessions; containers – runtime, pod status, resource usage (minute‑ or second‑level).
SaaS layer : application services – availability, request count, response time, HTTP status codes (second‑ or minute‑level).
2. Data Collection
Deploy appropriate exporters or probes on each node. Example: node_exporter on servers exposes metrics on port 9100, which Prometheus scrapes via HTTP.
Example command: curl localhost:9100/metrics
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 5.76669737e+06
# TYPE node_disk_info gauge
node_disk_info{device="dm-0",major="252",minor="0"} 1
# ... additional metric lines ...MySQL exporter provides metrics such as mysql_global_variables_thread_cache_size and mysql_global_variables_thread_stack.
# TYPE mysql_global_variables_thread_cache_size gauge
mysql_global_variables_thread_cache_size 9
# TYPE mysql_global_variables_thread_stack gauge
mysql_global_variables_thread_stack 262144
# ... additional metric lines ...3. Prometheus Server Configuration
Example job for node metrics:
- job_name: 'node'
metric_path: /metrics
scheme: http
scrape_interval: 30s
scrape_timeout: 20s
file_sd_configs:
- files: ['/prom/targets/node.yml']
refresh_interval: 30sLabels add key‑value pairs for later querying; targets list the actual endpoints (e.g., 192.168.0.1:9100).
4. Visualization with Grafana
Grafana connects to Prometheus as a data source and offers dashboards for resource overview, Kubernetes pod status, namespace statistics, and time‑range queries.
Grafana also manages user permissions through concepts of org, team, role, and user.
5. Alert Management
Alert rules define conditions, duration, severity, and annotations. Example alerts for high memory and disk usage are provided.
- alert: HighMemoryUsage
expr: 100 - (node_memory_MemAvailable_bytes{project='xx'} / node_memory_MemTotal_bytes{project='xx'}) * 100 > 98
for: 5m
labels:
severity: critical
type: server
annotations:
summary: "{{ $labels.mountpoint }} memory usage high!"
description: "Memory usage exceeds 98% (current: {{ printf \"%.2f\" $value }}%)"Alert states: Inactive, Pending, Firing. Alerts can be sent via email, DingTalk, WeChat, SMS, or webhook.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
