Operations 12 min read

Mastering SRE’s Four Golden Signals with Prometheus: A Practical Guide

This guide explains the four SRE golden signals—Latency, Traffic, Errors, and Saturation—covers their definitions, how to measure them with Prometheus in Node.js, compares them to RED and USE frameworks, and provides concrete alerting rules for reliable service monitoring.

dbaplus Community

May 11, 2025

Mastering SRE’s Four Golden Signals with Prometheus: A Practical Guide

Understanding the Four Golden Signals

Modern systems consist of many interacting components, making structured monitoring essential. Google’s SRE team defined four service‑centric metrics—Latency, Traffic, Errors, and Saturation—that together give a comprehensive view of user experience and system health.

Latency : Time taken to respond to a request, measured as a distribution (p50, p90, p99) rather than a simple average.

Traffic : Total number of requests received, providing context for other metrics.

Errors : Count of failed requests, expressed as absolute rate or percentage of total requests.

Saturation : Degree to which a service is “full”, often reflected by resource utilization such as CPU, memory, or connection‑pool usage.

These signals help you set Service Level Objectives (SLOs), detect performance degradation, and anticipate capacity limits.

Implementing the Signals with Prometheus

Prometheus is the de‑facto standard for metric collection in cloud‑native environments. Below are concrete Node.js examples for each signal.

1. Latency

Record request duration as a histogram and observe percentiles (p50, p90, p99).

import { Histogram } from 'prom-client';
const requestLatency = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request latency in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2.5, 5, 10]
});
function latencyMiddleware(req, res, next) {
  const start = Date.now();
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    requestLatency.observe({ method: req.method, route: req.route?.path || req.path, status_code: res.statusCode }, duration);
  });
  next();
}

2. Traffic

Use a monotonically increasing counter to track total requests.

import { Counter } from 'prom-client';
const requestCounter = new Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route']
});
function trafficMiddleware(req, res, next) {
  requestCounter.inc({ method: req.method, route: req.route?.path || req.path });
  next();
}

3. Errors

Increment a counter for each failed request, optionally adding dimensions such as error type.

import { Counter } from 'prom-client';
const errorCounter = new Counter({
  name: 'http_request_errors_total',
  help: 'Total number of HTTP request errors',
  labelNames: ['method', 'route', 'status_code', 'error_type']
});
function errorHandler(err, req, res, next) {
  errorCounter.inc({
    method: req.method,
    route: req.route?.path || req.path,
    status_code: res.statusCode || 500,
    error_type: err.name || 'UnknownError'
  });
  res.status(500).send('Something went wrong');
}

4. Saturation

Expose a gauge for resources such as a database connection‑pool usage ratio.

import { Gauge } from 'prom-client';
const connectionPoolGauge = new Gauge({
  name: 'db_connection_pool_usage_ratio',
  help: 'Database connection pool usage ratio',
  labelNames: ['pool_name']
});
function updateConnectionPoolMetrics() {
  const poolSize = db.pool.max;
  const active = db.pool.used;
  connectionPoolGauge.set({ pool_name: 'main' }, active / poolSize);
}
setInterval(updateConnectionPoolMetrics, 5000);

Comparing with RED and USE Frameworks

The Golden Signals overlap with other observability models. RED focuses on Rate, Errors, and Duration but omits Saturation, while USE emphasizes Utilization, Saturation, and Errors at the resource level. Combining elements from all three yields a balanced monitoring strategy.

Alerting Based on the Golden Signals

Prometheus alert rules can be defined for each signal. Examples:

- alert: HighLatency
  expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route)) > 2
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High latency on {{ $labels.route }}"
    description: "P99 latency for {{ $labels.route }} is above 2 seconds"

- alert: LowTraffic
  expr: sum(rate(http_requests_total[5m])) < 10
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Unusually low traffic detected"
    description: "Request rate has fallen below 10 rps for 10 minutes"

- alert: HighErrorRate
  expr: sum(rate(http_request_errors_total[5m])) / sum(rate(http_requests_total[5m])) > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "High error rate detected"
    description: "Error rate is above 5% for 5 minutes"

- alert: HighConnectionPoolSaturation
  expr: avg(db_connection_pool_usage_ratio) > 0.8
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Connection pool nearing capacity"
    description: "Database connection pool usage is above 80% for 5 minutes"

Final Thoughts

Implementing Latency, Traffic, Errors, and Saturation gives you a clear, service‑level view of health. By integrating these metrics with RED and USE, and by configuring sensible alerts, you can build a robust observability platform that keeps services reliable and users satisfied.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Observability SRE Prometheus Golden Signals

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.