Cloud Native 8 min read

Mastering Prometheus: Build a Cloud‑Native Monitoring System from Scratch

This article explains how to design a Prometheus‑based cloud‑native monitoring solution, covering target selection, metric collection, server configuration, Grafana visualization, and alert management with practical examples and code snippets.

Efficient Ops

Aug 6, 2023

Mastering Prometheus: Build a Cloud‑Native Monitoring System from Scratch

1. Monitoring Targets

Prometheus can monitor infrastructure, middleware, databases, containers, and SaaS services. Typical metrics for each layer (IaaS, PaaS, SaaS) and their collection frequencies are outlined.

IaaS layer : physical/virtual machines – server status, CPU, memory, disk I/O, network traffic, bandwidth, etc. (second‑ or minute‑level).

PaaS layer : databases – cluster status, connections, slow queries, locks, memory usage; middleware – status, connections, sessions; containers – runtime, pod status, resource usage (minute‑ or second‑level).

SaaS layer : application services – availability, request count, response time, HTTP status codes (second‑ or minute‑level).

2. Data Collection

Deploy appropriate exporters or probes on each node. Example: node_exporter on servers exposes metrics on port 9100, which Prometheus scrapes via HTTP.

Example command: curl localhost:9100/metrics

# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 5.76669737e+06
# TYPE node_disk_info gauge
node_disk_info{device="dm-0",major="252",minor="0"} 1
# ... additional metric lines ...

MySQL exporter provides metrics such as mysql_global_variables_thread_cache_size and mysql_global_variables_thread_stack.

# TYPE mysql_global_variables_thread_cache_size gauge
mysql_global_variables_thread_cache_size 9
# TYPE mysql_global_variables_thread_stack gauge
mysql_global_variables_thread_stack 262144
# ... additional metric lines ...

3. Prometheus Server Configuration

Example job for node metrics:

- job_name: 'node'
  metric_path: /metrics
  scheme: http
  scrape_interval: 30s
  scrape_timeout: 20s
  file_sd_configs:
    - files: ['/prom/targets/node.yml']
      refresh_interval: 30s

Labels add key‑value pairs for later querying; targets list the actual endpoints (e.g., 192.168.0.1:9100).

4. Visualization with Grafana

Grafana connects to Prometheus as a data source and offers dashboards for resource overview, Kubernetes pod status, namespace statistics, and time‑range queries.

Grafana also manages user permissions through concepts of org, team, role, and user.

5. Alert Management

Alert rules define conditions, duration, severity, and annotations. Example alerts for high memory and disk usage are provided.

- alert: HighMemoryUsage
  expr: 100 - (node_memory_MemAvailable_bytes{project='xx'} / node_memory_MemTotal_bytes{project='xx'}) * 100 > 98
  for: 5m
  labels:
    severity: critical
    type: server
  annotations:
    summary: "{{ $labels.mountpoint }} memory usage high!"
    description: "Memory usage exceeds 98% (current: {{ printf \"%.2f\" $value }}%)"

Alert states: Inactive, Pending, Firing. Alerts can be sent via email, DingTalk, WeChat, SMS, or webhook.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Observability Alerting Prometheus Grafana Cloud Native Monitoring

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.