Mastering Monitoring: Black‑Box vs White‑Box, Metrics, and Prometheus in Practice
This guide explains monitoring fundamentals, clears common misconceptions, compares black‑box and white‑box approaches, outlines key metrics such as latency, traffic, errors and saturation, and provides a deep dive into Prometheus architecture, data model, query language, and practical examples for CPU, memory, and disk monitoring.
1. Monitoring Concepts & Common Misconceptions
Monitoring is the core tool for managing infrastructure and services; it must be built and deployed together with applications. Without monitoring you cannot understand system state, diagnose failures, or prevent performance, cost, and availability issues.
2. Black‑Box vs White‑Box Monitoring
Black‑Box Monitoring
Observes applications or hosts from the outside, providing limited insight. Typical checks include:
Ping response from a host
Whether a specific TCP port is open
Correct HTTP status and data returned by an application
Presence of a specific process on the host
White‑Box Monitoring
Exposes internal state and performance data of the measured object, allowing deep insight through logs, structured events, or in‑memory aggregation (e.g., Prometheus metrics, HAProxy stats, varnishstats).
Log export: Parse HTTP server access logs to monitor request rate, latency, error percentage.
Structured event output: Send data directly to a processing system instead of writing to disk.
In‑memory aggregation: Metrics stored in memory can be read by command‑line tools or exposed via endpoints.
3. Key Metrics
Metrics collection can follow push (monitoring system pulls) or pull (services push) models.
Google’s four recommended metrics:
Latency – time required to serve a request
Traffic – number of requests sent
Errors – rate of failed requests
Saturation – queued work that cannot be processed
Brendan Burns adds resource‑level metrics such as utilization, saturation, and error count for CPU, disk, network, etc. The “Red Method” (Tom Wilkie) focuses on service‑level indicators like rate, errors, and latency.
4. Prometheus Overview
Introduction & Architecture
Prometheus is an open‑source monitoring and alerting toolkit that stores metrics as time‑series data with optional key‑value labels. It is widely used to monitor Kubernetes clusters.
Suitable & Unsuitable Scenarios
Suitable: Recording any numeric time‑series, machine‑centric monitoring, highly dynamic service‑oriented architectures, multi‑dimensional data queries, reliability‑focused, independent servers that continue working during network outages.
Unsuitable: Scenarios requiring 100 % accuracy, such as per‑request billing.
Data Model
Metrics are stored as time‑series consisting of a metric name, optional label set, timestamp, and value.
<metric_name>{<label_1="value_1">,<label_N="value_N">}<datapoint_numeric_value>Metric Types
Counter: Monotonically increasing value (e.g., total packets received).
Gauge: Value that can go up or down (e.g., temperature, disk space).
Histogram: Aggregates observations into buckets (e.g., request latency distribution).
Aggregation & Summarization
Single metrics are often less useful alone; aggregations such as count, sum, average, percentiles, standard deviation, or rate provide actionable insight.
Exporters
Prometheus uses exporters to expose metrics from hosts and applications. Notable examples include Node Exporter for system metrics and cAdvisor for Docker container metrics.
cAdvisor
cAdvisor (Container Advisor) collects, aggregates, and exports container metrics, covering memory limits, GPU usage, and provides a web UI for visualization.
Target Discovery Lifecycle
Service discovery → configuration → relabeling → scraping → metrics relabeling.
PromQL Query Language
Selectors and label matchers define which time‑series to query. Example selector:
$ prometheus_build_info{version="2.17.0"}Label matchers use operators =, !=, =~, !~. Range vectors are created with square brackets, and offsets shift the time window.
Common functions include
rate(),
irate(),
predict_linear(),
label_join(), and
label_replace().
CPU Usage Example
// avg(irate(node_cpu_seconds_total{job="node"}[5m] by (instance) * 100))CPU Saturation (Load) Example
Load averages can be compared against CPU count to detect saturation.
node_load1 > on (instance) 2 * count by (instance)(node_cpu_seconds_total{mode="idle"})Memory Usage Example
Memory metrics from Node Exporter include
node_memory_MemTotal_bytes,
node_memory_MemFree_bytes,
node_memory_Buffers_bytes, and
node_memory_Cached_bytes. Usage can be calculated as:
(total - free - buffers - cached) / total * 100Memory Saturation Example
Swap activity metrics
node_vmstat_pswpinand
node_vmstat_pswpoutcan be combined to assess memory pressure:
1024 * sum by (instance) (rate(node_vmstat_pgpgin[1m]) + rate(node_vmstat_pgpgout[1m]))Disk Usage Example
Disk space usage can be computed as:
(node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100Specific mount points can be targeted, and
predict_linearcan forecast when a filesystem will run out of space.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.