Operations 18 min read

Understanding SRE: Foundations, Metrics, and Tackling Technical Debt

This article introduces the fundamentals of Site Reliability Engineering (SRE), explains how to measure service stability with metrics like MTTR, MTBF, and availability, outlines the SRE workflow from prevention to post‑mortem, and discusses how to identify and reduce technical debt to improve system health.

HaoDF Tech Team

Oct 8, 2021

Understanding SRE: Foundations, Metrics, and Tackling Technical Debt

SRE Basics and Motivation

Product iteration speed and increasing bug/incident volume create a stressful environment where engineers spend most of their time firefighting instead of developing new features. Ensuring high availability is a common pain point, and the team at 好大夫在线 adopts Google SRE principles combined with local experiences to improve reliability.

SRE Responsibilities

SRE is not just a firefighting role; it is a systematic approach that covers fault prevention, detection, handling, and post‑mortem to create a closed loop for service reliability.

SRE Workflow

1. Fault Prevention: With micro‑services, complexity and middleware dependencies increase, requiring engineers to adopt failure‑aware programming, proper RPC error handling, retry strategies, distributed transaction consistency, and sensible circuit‑breaker limits.

2. Fault Detection: Instrumentation (RPC logs, middleware logs) is collected into ClickHouse, analyzed into metrics, and alerts are triggered via Prometheus.

3. Fault Handling: Alerts are received by SRE members who use Grafana dashboards and predefined troubleshooting screenshots to quickly locate issues.

4. Fault Post‑mortem: Experiences are codified into dashboards and integrated into governance platforms for future incidents.

Challenges in Measuring Service Availability

Time‑Based Metrics

Key reliability indicators include MTTR (Mean Time To Repair), MTTF (Mean Time To Failure), and MTBF (Mean Time Between Failures). Availability can be expressed as AO = MTBF / (MTBF + MTTF). The table below shows typical downtime for different availability targets:

Availability

Downtime / Year

Downtime / Month

Downtime / Week

Downtime / Day

90%

36.5 days

72 hours

16.8 hours

2.4 hours

99%

3.65 days

7.2 hours

1.68 hours

14.4 minutes

99.9%

8.76 hours

43.8 minutes

10.1 minutes

1.44 minutes

99.99%

52.56 minutes

4.38 minutes

1.01 minutes

8.66 seconds

99.999%

5.26 minutes

25.9 seconds

6.05 seconds

0.87 seconds

Quality‑Based Metrics

Service quality is measured by request success rate (successful requests / total requests) and error‑rate distributions such as 5xx proportion over a time window (e.g., 5% 5xx for 10 minutes). The number of nines (e.g., 3‑9, 4‑9) indicates the target reliability level.

Five "Golden" Metrics

Capacity: QPS/QPM, core‑link QPS, per‑instance QPS, minimum instance count, CPU utilization.

Availability: Core‑link health, per‑minute Sentry alerts, 5xx/4xx/429/430 ratios.

Latency: Percentile latency (P95, P99) rather than average.

Error Rate: 5xx/4xx ratios at the gateway (nginx/kong).

Manual Intervention: Robustness, automatic failover, idempotence, retry mechanisms, and reduced human actions.

SRE Toolchain

1. Data Collection: Fluent‑bit and gohangout send logs to Kafka.

2. Data Analysis: Stream and batch analysis using a custom TracerLog for dependency graphs, slow interfaces, slow SQL, and circular calls.

3. Data Storage: Prometheus for metrics, ClickHouse for long‑term storage with materialized views.

4. Monitoring Dashboards: ELK for log search, Grafana for metric visualization.

5. Alert Notification: AlertManager with custom hooks for phone, SMS, and WeChat alerts.

6. Alert Response: A PaaS platform automates routine operations and reduces operational cost.

Technical Debt as an SRE Entry Point

Technical debt arises from poor code, flawed business modeling, and suboptimal architecture. Common causes include over‑confidence, copy‑paste programming, and misunderstanding of agile speed without refactoring.

Key debt categories:

Bad code accumulation.

Business modeling issues.

Architecture design flaws.

Design principles (SOLID) and stable dependency metrics (Fan‑in, Fan‑out, I = Fan‑out/(Fan‑in+Fan‑out)) help assess and reduce debt.

Our Debt Identification & Optimization Approach

Using link analysis to spot risks such as slow interfaces (target P99 < 100 ms backend, < 600 ms frontend), high error rates, slow SQL, and unstable dependencies.

Dependency stability is measured by the I metric; lower I indicates higher stability. Risks include multiple calls, circular calls, and bidirectional calls.

Conclusion

The article shares SRE fundamentals, how SRE can use technical debt as a lever to improve system health, and recommends classic books like "Clean Code", "Clean Architecture", and SRE references for deeper study.

References are listed at the end of the original article.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations SRE Reliability technical debt

Written by

HaoDF Tech Team

HaoDF Online tech practice and sharing—join us to discuss and help create quality healthcare through technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.