Master the Four Golden Signals: A Practical Guide to System Monitoring
Understanding system health is essential for reliable services, and this guide explains how to use powerful monitoring tools to collect, visualize, and alert on the four golden signals—latency, traffic, errors, and saturation—across servers, applications, and external dependencies, helping teams detect and resolve issues efficiently.
Introduction
Knowing system status is crucial for ensuring application and service reliability. Monitoring systems that collect metrics, visualize data, and alert operators provide the best insight into deployment health and performance.
Four Golden Signals
The Google SRE book defines four key signals for user‑facing systems: latency, traffic, errors, and saturation.
Latency
Definition: Time required for a service to process a request.
Importance: Increased latency indicates performance degradation or bottlenecks; essential for fast‑fail and fast‑feedback in microservices.
Monitoring: Track response times such as tp99.
Latency measures the time to complete an operation; distinguishing successful from failed requests is vital.
Traffic
Definition: Data in/out flow, measuring service capacity.
Importance: Directly reflects load, useful for capacity planning.
Monitoring: Use TPS, QPS, or requests per second.
Traffic indicates how busy a component is and helps identify resource needs or routing issues.
Errors
Definition: Number of erroneous requests, usually expressed as an error rate.
Importance: High error rates signal stability problems or design flaws.
Monitoring: Track error counts, types, sources, and causes.
Distinguishing error types enables targeted alerts and faster troubleshooting.
Saturation
Definition: Measure of resource utilization and idle capacity.
Importance: High utilization can degrade performance.
Monitoring: Observe CPU, memory, disk, network usage and set thresholds.
Saturation reveals resource limits and can correlate with latency, traffic, or error spikes.
Measuring Across the Environment
Apply the four signals at each layer of the deployment hierarchy: individual server components, applications/services, server collections, external dependencies, and end‑to‑end user experience.
Metrics for Individual Server Components
Collect low‑level metrics from the underlying hardware and OS, such as CPU scheduling latency, CPU utilization, error events, and run‑queue length for saturation.
CPU : latency (scheduler delay), traffic (utilization), errors (CPU‑specific faults), saturation (run‑queue).
Memory : traffic (used memory), errors (out‑of‑memory), saturation (swap usage).
Storage : latency (await), traffic (I/O rate), errors (filesystem faults), saturation (I/O queue depth).
Network : latency (driver queue), traffic (bytes/packets per second), errors (packet loss), saturation (overflow/retransmits).
Also monitor OS‑level limits such as file handles and thread counts.
Metrics for Applications and Services
At the next level, track how applications use the underlying resources. Typical golden signals for customer‑facing apps are:
Latency: request completion time.
Traffic: requests per second.
Errors: application‑level failures.
Saturation: percentage of resources in use.
Include dependency‑related metrics like memory usage, open connections, and worker counts.
Metrics for Server Collections
When services span multiple instances, monitor coordination overhead, aggregate request rates, and collective resource usage.
Latency: time for pool to respond, including synchronization.
Traffic: pooled requests per second.
Errors: errors in client handling or peer communication.
Saturation: total resources used, number of active servers, available servers.
Metrics for External Dependencies
Track latency, traffic, errors, and saturation of third‑party services you cannot control, such as API response times, request volume, error rates, and account limits.
End‑to‑End User Experience Metrics
At the outermost layer, monitor the same four signals for the load balancer or entry point to gauge overall user impact.
Latency: time to fulfill a user request.
Traffic: user requests per second.
Errors: failures handling client requests.
Saturation: percentage of resources currently used.
Exceeding thresholds here directly affects SLAs.
Conclusion
The four golden signals provide a solid framework for building observability across all layers of a distributed system. While they are a great starting point, teams should also incorporate additional metrics specific to their environment to detect issues early and facilitate effective troubleshooting.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.