Operations 15 min read

Mastering Monitoring: Black‑Box vs White‑Box, Metrics, and Prometheus in Practice

This guide explains monitoring fundamentals, clears common misconceptions, compares black‑box and white‑box approaches, outlines key metrics such as latency, traffic, errors and saturation, and provides a deep dive into Prometheus architecture, data model, query language, and practical examples for CPU, memory, and disk monitoring.

Efficient Ops
Efficient Ops
Efficient Ops
Mastering Monitoring: Black‑Box vs White‑Box, Metrics, and Prometheus in Practice

1. Monitoring Concepts & Common Misconceptions

Monitoring is the core tool for managing infrastructure and services; it must be built and deployed together with applications. Without monitoring you cannot understand system state, diagnose failures, or prevent performance, cost, and availability issues.

2. Black‑Box vs White‑Box Monitoring

Black‑Box Monitoring

Observes applications or hosts from the outside, providing limited insight. Typical checks include:

Ping response from a host

Whether a specific TCP port is open

Correct HTTP status and data returned by an application

Presence of a specific process on the host

White‑Box Monitoring

Exposes internal state and performance data of the measured object, allowing deep insight through logs, structured events, or in‑memory aggregation (e.g., Prometheus metrics, HAProxy stats, varnishstats).

Log export: Parse HTTP server access logs to monitor request rate, latency, error percentage.

Structured event output: Send data directly to a processing system instead of writing to disk.

In‑memory aggregation: Metrics stored in memory can be read by command‑line tools or exposed via endpoints.

3. Key Metrics

Metrics collection can follow push (monitoring system pulls) or pull (services push) models.

Google’s four recommended metrics:

Latency – time required to serve a request

Traffic – number of requests sent

Errors – rate of failed requests

Saturation – queued work that cannot be processed

Brendan Burns adds resource‑level metrics such as utilization, saturation, and error count for CPU, disk, network, etc. The “Red Method” (Tom Wilkie) focuses on service‑level indicators like rate, errors, and latency.

4. Prometheus Overview

Introduction & Architecture

Prometheus is an open‑source monitoring and alerting toolkit that stores metrics as time‑series data with optional key‑value labels. It is widely used to monitor Kubernetes clusters.

Suitable & Unsuitable Scenarios

Suitable: Recording any numeric time‑series, machine‑centric monitoring, highly dynamic service‑oriented architectures, multi‑dimensional data queries, reliability‑focused, independent servers that continue working during network outages.

Unsuitable: Scenarios requiring 100 % accuracy, such as per‑request billing.

Data Model

Metrics are stored as time‑series consisting of a metric name, optional label set, timestamp, and value.

<metric_name>{<label_1="value_1">,<label_N="value_N">}<datapoint_numeric_value>

Metric Types

Counter: Monotonically increasing value (e.g., total packets received).

Gauge: Value that can go up or down (e.g., temperature, disk space).

Histogram: Aggregates observations into buckets (e.g., request latency distribution).

Aggregation & Summarization

Single metrics are often less useful alone; aggregations such as count, sum, average, percentiles, standard deviation, or rate provide actionable insight.

Exporters

Prometheus uses exporters to expose metrics from hosts and applications. Notable examples include Node Exporter for system metrics and cAdvisor for Docker container metrics.

cAdvisor

cAdvisor (Container Advisor) collects, aggregates, and exports container metrics, covering memory limits, GPU usage, and provides a web UI for visualization.

Target Discovery Lifecycle

Service discovery → configuration → relabeling → scraping → metrics relabeling.

PromQL Query Language

Selectors and label matchers define which time‑series to query. Example selector:

$ prometheus_build_info{version="2.17.0"}

Label matchers use operators =, !=, =~, !~. Range vectors are created with square brackets, and offsets shift the time window.

Common functions include

rate()

,

irate()

,

predict_linear()

,

label_join()

, and

label_replace()

.

CPU Usage Example

// avg(irate(node_cpu_seconds_total{job="node"}[5m] by (instance) * 100))

CPU Saturation (Load) Example

Load averages can be compared against CPU count to detect saturation.

node_load1 > on (instance) 2 * count by (instance)(node_cpu_seconds_total{mode="idle"})

Memory Usage Example

Memory metrics from Node Exporter include

node_memory_MemTotal_bytes

,

node_memory_MemFree_bytes

,

node_memory_Buffers_bytes

, and

node_memory_Cached_bytes

. Usage can be calculated as:

(total - free - buffers - cached) / total * 100

Memory Saturation Example

Swap activity metrics

node_vmstat_pswpin

and

node_vmstat_pswpout

can be combined to assess memory pressure:

1024 * sum by (instance) (rate(node_vmstat_pgpgin[1m]) + rate(node_vmstat_pgpgout[1m]))

Disk Usage Example

Disk space usage can be computed as:

(node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100

Specific mount points can be targeted, and

predict_linear

can forecast when a filesystem will run out of space.

monitoringCloud NativeoperationsobservabilitymetricsPrometheus
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.