Mastering SRE’s Four Golden Signals with Prometheus: A Practical Guide
This guide explains the four SRE golden signals—Latency, Traffic, Errors, and Saturation—covers their definitions, how to measure them with Prometheus in Node.js, compares them to RED and USE frameworks, and provides concrete alerting rules for reliable service monitoring.
Understanding the Four Golden Signals
Modern systems consist of many interacting components, making structured monitoring essential. Google’s SRE team defined four service‑centric metrics—Latency, Traffic, Errors, and Saturation—that together give a comprehensive view of user experience and system health.
Latency : Time taken to respond to a request, measured as a distribution (p50, p90, p99) rather than a simple average.
Traffic : Total number of requests received, providing context for other metrics.
Errors : Count of failed requests, expressed as absolute rate or percentage of total requests.
Saturation : Degree to which a service is “full”, often reflected by resource utilization such as CPU, memory, or connection‑pool usage.
These signals help you set Service Level Objectives (SLOs), detect performance degradation, and anticipate capacity limits.
Implementing the Signals with Prometheus
Prometheus is the de‑facto standard for metric collection in cloud‑native environments. Below are concrete Node.js examples for each signal.
1. Latency
Record request duration as a histogram and observe percentiles (p50, p90, p99).
import { Histogram } from 'prom-client';
const requestLatency = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request latency in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2.5, 5, 10]
});
function latencyMiddleware(req, res, next) {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
requestLatency.observe({ method: req.method, route: req.route?.path || req.path, status_code: res.statusCode }, duration);
});
next();
}2. Traffic
Use a monotonically increasing counter to track total requests.
import { Counter } from 'prom-client';
const requestCounter = new Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route']
});
function trafficMiddleware(req, res, next) {
requestCounter.inc({ method: req.method, route: req.route?.path || req.path });
next();
}3. Errors
Increment a counter for each failed request, optionally adding dimensions such as error type.
import { Counter } from 'prom-client';
const errorCounter = new Counter({
name: 'http_request_errors_total',
help: 'Total number of HTTP request errors',
labelNames: ['method', 'route', 'status_code', 'error_type']
});
function errorHandler(err, req, res, next) {
errorCounter.inc({
method: req.method,
route: req.route?.path || req.path,
status_code: res.statusCode || 500,
error_type: err.name || 'UnknownError'
});
res.status(500).send('Something went wrong');
}4. Saturation
Expose a gauge for resources such as a database connection‑pool usage ratio.
import { Gauge } from 'prom-client';
const connectionPoolGauge = new Gauge({
name: 'db_connection_pool_usage_ratio',
help: 'Database connection pool usage ratio',
labelNames: ['pool_name']
});
function updateConnectionPoolMetrics() {
const poolSize = db.pool.max;
const active = db.pool.used;
connectionPoolGauge.set({ pool_name: 'main' }, active / poolSize);
}
setInterval(updateConnectionPoolMetrics, 5000);Comparing with RED and USE Frameworks
The Golden Signals overlap with other observability models. RED focuses on Rate, Errors, and Duration but omits Saturation, while USE emphasizes Utilization, Saturation, and Errors at the resource level. Combining elements from all three yields a balanced monitoring strategy.
Alerting Based on the Golden Signals
Prometheus alert rules can be defined for each signal. Examples:
- alert: HighLatency
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route)) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High latency on {{ $labels.route }}"
description: "P99 latency for {{ $labels.route }} is above 2 seconds" - alert: LowTraffic
expr: sum(rate(http_requests_total[5m])) < 10
for: 10m
labels:
severity: warning
annotations:
summary: "Unusually low traffic detected"
description: "Request rate has fallen below 10 rps for 10 minutes" - alert: HighErrorRate
expr: sum(rate(http_request_errors_total[5m])) / sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is above 5% for 5 minutes" - alert: HighConnectionPoolSaturation
expr: avg(db_connection_pool_usage_ratio) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Connection pool nearing capacity"
description: "Database connection pool usage is above 80% for 5 minutes"Final Thoughts
Implementing Latency, Traffic, Errors, and Saturation gives you a clear, service‑level view of health. By integrating these metrics with RED and USE, and by configuring sensible alerts, you can build a robust observability platform that keeps services reliable and users satisfied.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
