Operations 10 min read

Master Monitoring: Collect Metrics for New Systems Using White‑Box Techniques & the Four Golden SRE Indicators

This article explains how to approach monitoring for a newly introduced system by focusing on white‑box metric collection, distinguishing basic and business metrics, outlining common collection methods, and detailing Google SRE's four golden indicators—error, latency, traffic, and saturation—to guide effective observability.

ITPUB

Jan 31, 2019

Master Monitoring: Collect Metrics for New Systems Using White‑Box Techniques & the Four Golden SRE Indicators

In previous monitoring series we covered Kafka, Zookeeper, ElasticSearch, Hadoop and e‑commerce platforms, but real‑world services consist of many open‑source or custom middlewares whose complexity grows with new features. When a brand‑new system appears, the first question is how to start monitoring it.

White‑Box vs Black‑Box Monitoring

White‑box monitoring observes internal state (CPU, memory, logs, JMX, etc.), while black‑box monitoring checks external behavior (port probes, HTTP checks, end‑to‑end functional tests). The article focuses on white‑box metric collection; black‑box techniques are covered in a linked article about Kafka Monitor source code.

Types of Monitoring Metrics

Metrics are divided into basic monitoring (system‑level data such as CPU, memory, disk, ports, processes) and business monitoring (application‑specific indicators that reflect real service health). Basic metrics alone rarely indicate service problems; they must be combined with business metrics for meaningful insight.

Common Collection Methods

Logs: Capture service activity via Nginx access logs, Rsyslog, Logstash, Filebeat, Flume, etc.

JMX: Java services expose metrics via JMX; tools like jmxtrans or jmxcmd can scrape them.

REST: Services such as Hadoop or ElasticSearch provide REST APIs for metric extraction.

OpenMetrics: Prometheus‑compatible exporters expose metrics in the OpenMetrics format.

Command‑line: Some services offer local commands that output metrics.

Push (active reporting): Services can push metrics to a monitoring system using custom sinks or plugins.

Instrumentation (埋点): Code‑level hooks that emit business metrics, though they require development effort.

Other methods: Direct queries like Zookeeper four‑letter commands or MySQL SHOW STATUS.

If no ready‑made collector exists, a custom script may be needed.

Google SRE’s Four Golden Indicators

The four core goals of monitoring are to understand service health, detect failures, and aid root‑cause analysis. The golden indicators are:

Error (and error rate)

Track request errors; focus on core functional failures, fundamental component loss, master node health, and the number of healthy nodes.

Latency (service response time)

Measure both internal I/O/network latency (basic) and end‑to‑end request latency for key functions (business), as rising latency can lead to request back‑log and system snowballing.

Traffic (request volume)

Monitor network/disk I/O at the system level and QPS/PV/UV at the application level; sudden spikes or drops may signal attacks or failures.

Saturation (resource utilization)

Assess how close a service is to its capacity limits—CPU, memory, disk, network, message‑queue length, or specific functional unit usage (e.g., HDFS blocks, Kafka partitions).

Each indicator should be observed through both basic and business metrics to obtain a complete picture.

Summary

The article enumerates typical ways to collect monitoring data, emphasizes the need to combine system‑level and business‑level metrics, and explains how the four golden SRE indicators guide the design of effective observability for any complex service.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Operations Observability Metrics SRE white-box

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.