Building a Basic Monitoring System from Zero: How to View CPU, Memory, Disk, and Network
This article walks you through setting up a complete monitoring stack with Prometheus, node_exporter, Grafana and Alertmanager, explains how to interpret the four core dimensions—CPU, memory, disk and network—using a structured troubleshooting workflow, and provides real‑world case studies, scripts and best‑practice recommendations.
Overview
The article explains why a monitoring system is the baseline for reliable operations and introduces a four‑dimensional (CPU, memory, disk, network) analysis framework based on the Google SRE USE method.
1. Prerequisites and Environment
Required components and versions:
Ubuntu 24.04 LTS (2 CPU / 4 GB RAM minimum)
Prometheus 3.2.1 (2 CPU / 4 GB SSD recommended)
node_exporter 1.9.0 (tiny footprint, runs as a non‑root user)
Grafana 11.5.2 (1 CPU / 2 GB RAM)
Alertmanager 0.28.1 (512 MB RAM)
2. Installation
Deploy node_exporter on every host:
# /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter 1.9.0
After=network-online.target
Wants=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--collector.cpu \
--collector.meminfo \
--collector.diskstats \
--collector.netdev \
--collector.filesystem \
--collector.loadavg \
--collector.netstat \
--collector.vmstat \
--web.listen-address=:9100
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.targetEnable and verify:
systemctl daemon-reload
systemctl enable --now node_exporter
curl -s http://localhost:9100/metrics | head -20Deploy Prometheus :
# /etc/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_timeout: 10s
scrape_configs:
- job_name: "node"
static_configs:
- targets: ["192.168.1.10:9100", "192.168.1.11:9100", "192.168.1.12:9100"]
labels:
env: "production"
dc: "dc-01"Systemd unit:
# /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus 3.x
After=network-online.target
Wants=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--storage.tsdb.retention.time=30d \
--web.listen-address=:9090 \
--web.enable-lifecycle
Restart=always
[Install]
WantedBy=multi-user.targetDeploy Alertmanager with routing and inhibition rules (see article for full YAML).
3. Diagnosis Workflow
A fixed “four‑dimensional coordinate system” guides the investigation:
Fault perception → Four‑dimensional定位 → Metric verification → Root‑cause lock → Fix verification (observe 15 min)First‑round check uses the Grafana “Node Exporter Full” dashboard to spot which dimension is abnormal. If Grafana is unavailable, a one‑liner gathers a quick snapshot:
echo "=== CPU ===" && uptime && \
echo "=== MEM ===" && free -h && \
echo "=== DISK ===" && df -h && \
echo "=== NET ===" && ss -sSecond‑round drill‑down commands per dimension:
CPU : mpstat -P ALL 1 5, ps aux --sort=-%cpu | head -15, ps aux | awk '$8 ~ /D/ {print}', PromQL rate(node_cpu_seconds_total{mode!="idle"}[5m]) Memory :
cat /proc/meminfo | grep -E "MemTotal|MemFree|MemAvailable|Buffers|Cached|SwapTotal|SwapFree|Slab", ps aux --sort=-%mem | head -15, slabtop -o | head -20, dmesg -T | grep -i "oom\|killed" Disk : df -hi, iostat -xdm 1 5, find / -xdev -type f -size +500M -exec ls -lh {} \;, lsof +D /var/log Network : ss -s, ss -tan | awk '{print $1}' | sort -rn, ip -s link show eth0, ethtool -S eth0 | grep -i "err\|drop\|miss" The article also provides a “root‑cause matrix” that maps observed symptoms to likely causes and suggested remediation.
4. Real‑World Cases
Case 1 – CPU spikes to 100 % while QPS stays flat : Alertmanager fired HighCpuUsage, Grafana showed user‑mode CPU > 95 %. ps aux --sort=-%cpu | head -5 revealed a Python worker consuming 400 % CPU. The bug was a nested‑loop O(n³) in report generation; killing the process and fixing the loop reduced load average to 0.8.
Case 2 – OOM kills MySQL at 3 AM : Memory usage rose from 70 % to 99 % before the process disappeared. dmesg -T | grep -A5 "oom-kill" showed MySQL killed. A nightly data‑sync script loaded the whole table into memory; rewriting it to stream data and adding oom_score_adj=-1000 prevented future kills.
Case 3 – /var partition reports 100 % usage but du shows only 12 GB : lsof +L1 | grep /var exposed an old rsyslog log file (8.2 GB) held open after logrotate. Restarting rsyslog and fixing the logrotate config with copytruncate freed space.
Case 4 – Intermittent intra‑network timeouts despite low bandwidth : Network panel showed high TCP retransmits. ethtool -S eth0 revealed thousands of CRC errors and the interface was operating at 100 Mb/s half‑duplex. Re‑cabling restored 1 Gb/s full‑duplex and eliminated timeouts.
5. Best Practices and Pitfalls
Separate alert severity (warning vs. critical) and always configure for clauses to avoid noise.
Keep scrape_interval ≥ 15 s; higher frequency only adds storage pressure.
Design label taxonomy early (env, dc, team, service) – changing it later rewrites all dashboards.
Monitor the monitor: up{job="node"}, scrape_duration_seconds, prometheus_tsdb_head_series, and alert on target down or high scrape latency.
Do not enable every node_exporter collector; limit to needed ones to reduce noise.
Use for and hysteresis in disk alerts to avoid flapping.
6. Self‑Monitoring of the Stack
Key health metrics:
up{job="node"} # target reachability
scrape_duration_seconds{job="node"} # scrape latency
prometheus_tsdb_head_series # series cardinality
prometheus_rule_evaluation_failures_total # rule errors
prometheus_notifications_dropped_total # alert lossSample alert rules (meta‑monitoring) are provided in the article, e.g. TargetDown, PrometheusHighMemory, SlowScrape, and CardinalityExplosion.
7. Summary and Further Reading
The guide reinforces that monitoring is not an after‑thought but a systematic four‑dimensional troubleshooting tool, stresses the importance of well‑designed alerts, label hygiene, and closing the loop by monitoring the monitoring stack itself. References include the official Prometheus docs, Google SRE book, Brendan Gregg’s USE method, and community alert rule collections.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
