Operations 18 min read

Mastering Monitoring: From Concepts to Prometheus in Operations

This article explains monitoring fundamentals, distinguishes black‑box and white‑box approaches, outlines key metrics and their aggregation, and provides a comprehensive guide to Prometheus architecture, data model, query language, and practical examples for CPU, memory, and disk usage monitoring.

MaGe Linux Operations

Jan 25, 2024

Mastering Monitoring: From Concepts to Prometheus in Operations

1. Monitoring Concepts & Misconceptions

Monitoring is the core tool for managing infrastructure and business; it must be built and deployed together with applications. Without monitoring you cannot understand system state, diagnose failures, or obtain performance, cost, and status information.

Misconceptions:

Avoid mechanical, inaccurate, static, infrequent monitoring and lack of automation or self‑service.

2. Black‑Box & White‑Box Monitoring

1. Black‑Box Monitoring

Applications or hosts are observed from the outside, which can be limited. Checks evaluate whether the observed system responds as expected.

Examples:

1) Does the host respond to PING requests? 2) Is a specific TCP port open? 3) Does the application return correct data and status codes for specific HTTP requests? 4) Is a particular application process running on its host?

2. White‑Box Monitoring

Systems expose internal state and performance data, providing powerful introspection that reveals health of internal components otherwise hard to determine.

1) Export logs : Export logs (e.g., HTTP server access logs) to monitor request rate, latency, and error percentage. 2) Structured event output : Similar to logging but sends data directly to processing systems for analysis and aggregation. 3) In‑memory aggregation : Data can reside on the endpoint or be read by command‑line tools (e.g., /metrics with Prometheus, HAProxy stats page, varnishstats).

3. Metrics

Metrics collection can follow two approaches: push (monitoring system pulls) or pull (monitored services push).

Push VS Pull

What to measure:

Google proposes four metrics to monitor:

Latency: time required for a service request

Traffic: number of requests being sent

Errors: failure rate

Saturation: work queued and not yet processed

Brendan’s method focuses on each resource (CPU, disk, network interface, etc.) and suggests monitoring:

Utilization: percentage of resource busy time

Saturation: work the resource cannot handle, usually queued

Errors: number of errors occurring

Tom Wilkie’s red method emphasizes service‑level metrics rather than low‑level system metrics, useful for predicting customer experience. When error rate rises, it likely impacts user experience.

Rate: requests per second

Errors: number of failed requests per second

Persistence: time spent on those requests

4. Prometheus

1. Introduction & Architecture

Prometheus is an open‑source monitoring and alerting toolkit that stores collected metrics as time‑series data with timestamps and optional key‑value pair labels. It is widely used to monitor Kubernetes clusters.

2. Suitable & Unsuitable Scenarios

Suitable scenarios : Prometheus records any numeric time‑series, fits machine‑centric monitoring and highly dynamic service‑oriented architectures, excels at multi‑dimensional data queries, is reliable during outages, and each server operates independently without remote storage dependencies.

Unsuitable scenarios : Cases requiring 100% accuracy, such as per‑request billing, where Prometheus’s eventual consistency is insufficient.

3. Data Model

Because monitoring data volume is huge, Prometheus uses time‑series storage (timestamp + value).

Prometheus local storage :

Prometheus’s local storage is called Prometheus TSDB. Its core design includes blocks and a Write‑Ahead Log (WAL). A block contains chunks, index, meta.json, and tombstones.

TSDB splits stored data into blocks based on time. Block size grows with configured step multiples. Small blocks merge into larger ones to reduce storage and memory usage, facilitating indexing.

Each block has a globally unique name generated via ULID, allowing easy sorting by creation time.

WAL (Write‑Ahead Logging) records data before it is persisted, ensuring durability. Prometheus uses WAL to prevent loss of in‑memory data before it is flushed to disk.

Prometheus data model :

Prometheus stores data as time‑series with labels, timestamps, and values.

Notation:

<metric_name>{<label_1="value_1", <label_N="value_N">}<datapoint_numeric_value>

4. Metric Types

Counter: ever‑increasing count of received packets.

Gauge: snapshot of a measurement that can go up or down (e.g., temperature, disk space, memory usage).

Histogram: aggregates values over time, useful for percentages or request counts within intervals.

5. Metric Summarization & Aggregation

Single metrics have limited value; combining and visualizing multiple metrics often requires mathematical transformations such as count, sum, average, median, percentile, standard deviation, rate, etc.

Metric aggregation provides a unified view across multiple sources.

6. NodeExporter Deployment

Prometheus uses exporter tools to expose metrics from hosts and applications; many exporter types exist.

7. cAdvisor Monitoring Docker Containers

cAdvisor (Container Advisor) by Google collects, aggregates, analyzes, and exports data from running containers, covering memory limits, GPU metrics, and more.

cAdvisor runs as a container, gathers data from the Docker daemon and Linux cgroups, and provides automatic discovery.

It also offers a useful web UI for visualizing host and container status.

8. Capturing Target Lifecycle

Service discovery → configuration → relabeling → scraping → metrics relabeling.

9. PromQL Query Language

(1) Selector : A selector is a set of label matchers (including the metric name) that identifies time series.

Label matchers are enclosed in braces; they can be omitted for all series. Selectors return instant or range vectors.

// example: $ prometheus_build_info{version="2.17.0"}

(2) Label matcher : Restricts queries to specific label values using operators =, !=, =~, !~.

(3) Range, offset, subquery : Range vectors use [] to specify a time range; offset queries past data; subqueries allow nested queries.

(4) PromQL operators : Vector matching supports one‑to‑one, many‑to‑one, one‑to‑many joins.

(5) PromQL functions : label_join() and label_replace() manipulate labels; predict_linear() forecasts future values using linear regression; rate(), irate(), sort(), sort_desc() provide rate calculations and ordering.

10. Calculate CPU Usage

// example: avg(irate(node_cpu_seconds_total{job="node"}[5m] by (instance) * 100))

11. Calculate CPU Load (Saturation)

CPU saturation can be observed via average load, which considers the number of CPU cores over a time window. Load lower than core count is normal; sustained higher load indicates saturation.

Use node_load* metrics (e.g., node_load1 for 1‑minute load).

// calculate number of CPUs
count by (instance)(node_cpu_seconds_total{mode="idle"})
// compare load to CPU count
node_load1 > on (instance) 2 * count by (instance)(node_cpu_seconds_total{mode="idle"})

12. Calculate Memory Usage

Node Exporter provides memory metrics prefixed with node_memory. Use total, free, buffers, and cached values to compute usage:

// total memory
node_memory_MemTotal_bytes
// free memory
node_memory_MemFree_bytes
// buffers
node_memory_Buffers_bytes
// cached memory
node_memory_Cached_bytes
// usage formula
(total - free - buffers - cached) / total * 100

13. Calculate Memory Saturation

Monitor memory and disk I/O saturation using Node Exporter metrics from /proc/vmstat:

node_vmstat_pswpin: bytes read from disk to memory per second

node_vmstat_pswpout: bytes written from memory to disk per second

1024 * sum by (instance) (rate(node_vmstat_pgpgin[1m]) + rate(node_vmstat_pgpgout[1m]))

Set up visualizations or alerts to detect abnormal host behavior.

14. Disk Usage

Measure disk usage (not utilization) using metrics like node_filesystem_size_bytes and node_filesystem_free_bytes per mount point.

// root filesystem usage
(node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100

For specific mount points such as /data:

(node_filesystem_size_bytes{mountpoint="/data"} - node_filesystem_free_bytes{mountpoint="/data"}) / node_filesystem_size_bytes{mountpoint="/data"} * 100

Use regular expressions to match multiple mount points:

(node_filesystem_size_bytes{mountpoint="/|/run"} - node_filesystem_free_bytes{mountpoint="/|/run"}) / node_filesystem_size_bytes{mountpoint="/|/run"} * 100

Predict when disk space will run out with predict_linear:

// predict exhaustion in 4 hours for root
predict_linear(node_filesystem_free_bytes{mountpoint="/"}[1h], 4*3600) < 0

// predict for any filesystem with job="node"
predict_linear(node_filesystem_free_bytes{job="node"}[1h], 4*3600) < 0

Link: https://blog.51cto.com/u_15576159/9380709

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Observability Metrics prometheus

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.