Operations 19 min read

Measuring & Boosting Microservice Reliability: Metrics, SLI/SLO, MTTR

This article explains how to define, measure, and improve microservice reliability using availability metrics, the four golden signals, RED and USE methods, and practical SLI/SLO and MTTR practices, offering concrete guidance for effective service governance.

Tech Architecture Stories

Aug 20, 2023

Measuring & Boosting Microservice Reliability: Metrics, SLI/SLO, MTTR

This section introduces the fundamental part of microservice governance: measurement.

The full process is: define → measure → improve. This article focuses on metric definition and measurement.

Quotes: "If you can't measure it, you can't improve it." – Peter Drucker; "What you measure is what you get." – H. Thomas Johnson.

These sayings highlight the necessity of measurement and the pitfalls of misleading metrics.

Availability Metrics

Public clouds usually define SLA based on availability. For example, Google Cloud Platform defines monthly uptime percentage ≥ 99.95% with specific conditions:

Backoff requirement : after an error, the application waits before retrying, starting with a 1‑second interval and exponentially increasing up to 32 seconds. Downtime : error rate exceeding 10 %. Downtime period : a continuous five‑minute downtime interval; shorter interruptions are ignored. Error rate : proportion of requests that result in HTTP 500 internal‑error responses over total valid requests, excluding retries that meet backoff rules.

Additional common reliability indicators include MTTR, MTBF, and overall availability, which are essential for any organization with service dependencies.

Mean Time Between Failures (MTBF) – average time between two failures.

Mean Time To Recovery (MTTR) – total repair time divided by number of failures; e.g., 40 hours of downtime over 20 failures yields MTTR = 2 hours.

Availability – A = MTBF / (MTBF + MTTR). Example: MTBF = 438 000 h, MTTR = 2 h gives A ≈ 0.9999954.

Improving availability can be done by increasing MTBF or decreasing MTTR.

Microservice Monitoring Metrics

Traditional RPC‑based Monitoring

Typical RPC frameworks expose four key metrics:

Traffic – request volume per unit time; monitor trends, set thresholds, and consider both high and low traffic scenarios.

Latency – impacts throughput; use percentile values (e.g., P95, P90) instead of averages.

Timeout rate – proportion of requests exceeding a latency threshold; proper timeout settings across call chains are crucial.

Error rate – includes timeouts and other failures; watch for high error rates on low‑traffic services.

Google SRE also promotes the Four Golden Signals: latency, traffic, errors, and saturation.

Google SRE Four Golden Signals

These signals help assess user experience, service interruptions, and business impact.

Latency – time to serve a request; distinguish between successful and failed request latency.

Traffic – measures demand (e.g., requests per second).

Errors – rate of all error responses, including HTTP 500 and other failure modes.

Saturation – degree to which a resource is fully utilized; high utilization often precedes performance degradation.

Other Monitoring Approaches

RED

RED (Rate, Errors, Duration) adapts the golden signals for cloud‑native and microservice environments, focusing on request rate, error count, and request duration.

USE

USE (Utilization, Saturation, Errors) targets system‑level performance analysis, examining CPU, memory, network, and disk usage, as well as saturation and error counts.

SLI/O

SLI (Service Level Indicator) – ratio of successful events to total events, chosen to reflect service stability.

SLO (Service Level Objective) – target level for an SLI; exceeding it indicates satisfactory service.

Four SLI types:

Availability

Success ratio of requests within a defined time window.

Latency

Proportion of requests meeting latency thresholds (e.g., P90 ≤ X, P99 ≤ Y).

Quality

Ratio of non‑degraded requests; measures impact during resource bottlenecks.

Freshness

Proportion of data updated within a required time frame (e.g., 99 % of scores reflected within 30 minutes).

A Practical Implementation Method

Combine SLI/SLO with MTTR for microservice governance.

Define SLI/O and Thresholds

Identify business‑critical functions and select 3‑5 key service interfaces per function.

Instrument monitoring points as close to the user as possible (frontend > gateway > business layer).

Set reasonable SLOs and iteratively raise them (e.g., from 99 % to 99.9 %).

Track process metrics such as SLO alarm precision and recall to evaluate alert quality.

MTTR – Measuring Fault‑Resolution Speed

When an SLO breach triggers an alarm, the incident handling flow includes TTI (detection), TTK (diagnosis), TTF/TTM (mitigation/recovery), and TTV (verification). Reducing these times shortens user impact.

Next article will discuss how to handle SLO‑generated alerts effectively.