Operations 17 min read

Master the Four Golden Signals: A Practical Guide to System Monitoring

Understanding system health is essential for reliable services, and this guide explains how to use powerful monitoring tools to collect, visualize, and alert on the four golden signals—latency, traffic, errors, and saturation—across servers, applications, and external dependencies, helping teams detect and resolve issues efficiently.

Efficient Ops

Oct 29, 2024

Master the Four Golden Signals: A Practical Guide to System Monitoring

Introduction

Knowing system status is crucial for ensuring application and service reliability. Monitoring systems that collect metrics, visualize data, and alert operators provide the best insight into deployment health and performance.

Four Golden Signals

The Google SRE book defines four key signals for user‑facing systems: latency, traffic, errors, and saturation.

Latency

Definition: Time required for a service to process a request.

Importance: Increased latency indicates performance degradation or bottlenecks; essential for fast‑fail and fast‑feedback in microservices.

Monitoring: Track response times such as tp99.

Latency measures the time to complete an operation; distinguishing successful from failed requests is vital.

Traffic

Definition: Data in/out flow, measuring service capacity.

Importance: Directly reflects load, useful for capacity planning.

Monitoring: Use TPS, QPS, or requests per second.

Traffic indicates how busy a component is and helps identify resource needs or routing issues.

Errors

Definition: Number of erroneous requests, usually expressed as an error rate.

Importance: High error rates signal stability problems or design flaws.

Monitoring: Track error counts, types, sources, and causes.

Distinguishing error types enables targeted alerts and faster troubleshooting.

Saturation

Definition: Measure of resource utilization and idle capacity.

Importance: High utilization can degrade performance.

Monitoring: Observe CPU, memory, disk, network usage and set thresholds.

Saturation reveals resource limits and can correlate with latency, traffic, or error spikes.

Measuring Across the Environment

Apply the four signals at each layer of the deployment hierarchy: individual server components, applications/services, server collections, external dependencies, and end‑to‑end user experience.

Metrics for Individual Server Components

Collect low‑level metrics from the underlying hardware and OS, such as CPU scheduling latency, CPU utilization, error events, and run‑queue length for saturation.

CPU : latency (scheduler delay), traffic (utilization), errors (CPU‑specific faults), saturation (run‑queue).

Memory : traffic (used memory), errors (out‑of‑memory), saturation (swap usage).

Storage : latency (await), traffic (I/O rate), errors (filesystem faults), saturation (I/O queue depth).

Network : latency (driver queue), traffic (bytes/packets per second), errors (packet loss), saturation (overflow/retransmits).

Also monitor OS‑level limits such as file handles and thread counts.

Metrics for Applications and Services

At the next level, track how applications use the underlying resources. Typical golden signals for customer‑facing apps are:

Latency: request completion time.

Traffic: requests per second.

Errors: application‑level failures.

Saturation: percentage of resources in use.

Include dependency‑related metrics like memory usage, open connections, and worker counts.

Metrics for Server Collections

When services span multiple instances, monitor coordination overhead, aggregate request rates, and collective resource usage.

Latency: time for pool to respond, including synchronization.

Traffic: pooled requests per second.

Errors: errors in client handling or peer communication.

Saturation: total resources used, number of active servers, available servers.

Metrics for External Dependencies

Track latency, traffic, errors, and saturation of third‑party services you cannot control, such as API response times, request volume, error rates, and account limits.

End‑to‑End User Experience Metrics

At the outermost layer, monitor the same four signals for the load balancer or entry point to gauge overall user impact.

Latency: time to fulfill a user request.

Traffic: user requests per second.

Errors: failures handling client requests.

Saturation: percentage of resources currently used.

Exceeding thresholds here directly affects SLAs.

Conclusion

The four golden signals provide a solid framework for building observability across all layers of a distributed system. While they are a great starting point, teams should also incorporate additional metrics specific to their environment to detect issues early and facilitate effective troubleshooting.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SRE

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.