Understanding SRE: Foundations, Metrics, and Tackling Technical Debt
This article introduces the fundamentals of Site Reliability Engineering (SRE), explains how to measure service stability with metrics like MTTR, MTBF, and availability, outlines the SRE workflow from prevention to post‑mortem, and discusses how to identify and reduce technical debt to improve system health.
SRE Basics and Motivation
Product iteration speed and increasing bug/incident volume create a stressful environment where engineers spend most of their time firefighting instead of developing new features. Ensuring high availability is a common pain point, and the team at 好大夫在线 adopts Google SRE principles combined with local experiences to improve reliability.
SRE Responsibilities
SRE is not just a firefighting role; it is a systematic approach that covers fault prevention, detection, handling, and post‑mortem to create a closed loop for service reliability.
SRE Workflow
1. Fault Prevention: With micro‑services, complexity and middleware dependencies increase, requiring engineers to adopt failure‑aware programming, proper RPC error handling, retry strategies, distributed transaction consistency, and sensible circuit‑breaker limits.
2. Fault Detection: Instrumentation (RPC logs, middleware logs) is collected into ClickHouse, analyzed into metrics, and alerts are triggered via Prometheus.
3. Fault Handling: Alerts are received by SRE members who use Grafana dashboards and predefined troubleshooting screenshots to quickly locate issues.
4. Fault Post‑mortem: Experiences are codified into dashboards and integrated into governance platforms for future incidents.
Challenges in Measuring Service Availability
Time‑Based Metrics
Key reliability indicators include MTTR (Mean Time To Repair), MTTF (Mean Time To Failure), and MTBF (Mean Time Between Failures). Availability can be expressed as AO = MTBF / (MTBF + MTTF). The table below shows typical downtime for different availability targets:
Availability
Downtime / Year
Downtime / Month
Downtime / Week
Downtime / Day
90%
36.5 days
72 hours
16.8 hours
2.4 hours
99%
3.65 days
7.2 hours
1.68 hours
14.4 minutes
99.9%
8.76 hours
43.8 minutes
10.1 minutes
1.44 minutes
99.99%
52.56 minutes
4.38 minutes
1.01 minutes
8.66 seconds
99.999%
5.26 minutes
25.9 seconds
6.05 seconds
0.87 seconds
Quality‑Based Metrics
Service quality is measured by request success rate (successful requests / total requests) and error‑rate distributions such as 5xx proportion over a time window (e.g., 5% 5xx for 10 minutes). The number of nines (e.g., 3‑9, 4‑9) indicates the target reliability level.
Five "Golden" Metrics
Capacity: QPS/QPM, core‑link QPS, per‑instance QPS, minimum instance count, CPU utilization.
Availability: Core‑link health, per‑minute Sentry alerts, 5xx/4xx/429/430 ratios.
Latency: Percentile latency (P95, P99) rather than average.
Error Rate: 5xx/4xx ratios at the gateway (nginx/kong).
Manual Intervention: Robustness, automatic failover, idempotence, retry mechanisms, and reduced human actions.
SRE Toolchain
1. Data Collection: Fluent‑bit and gohangout send logs to Kafka.
2. Data Analysis: Stream and batch analysis using a custom TracerLog for dependency graphs, slow interfaces, slow SQL, and circular calls.
3. Data Storage: Prometheus for metrics, ClickHouse for long‑term storage with materialized views.
4. Monitoring Dashboards: ELK for log search, Grafana for metric visualization.
5. Alert Notification: AlertManager with custom hooks for phone, SMS, and WeChat alerts.
6. Alert Response: A PaaS platform automates routine operations and reduces operational cost.
Technical Debt as an SRE Entry Point
Technical debt arises from poor code, flawed business modeling, and suboptimal architecture. Common causes include over‑confidence, copy‑paste programming, and misunderstanding of agile speed without refactoring.
Key debt categories:
Bad code accumulation.
Business modeling issues.
Architecture design flaws.
Design principles (SOLID) and stable dependency metrics (Fan‑in, Fan‑out, I = Fan‑out/(Fan‑in+Fan‑out)) help assess and reduce debt.
Our Debt Identification & Optimization Approach
Using link analysis to spot risks such as slow interfaces (target P99 < 100 ms backend, < 600 ms frontend), high error rates, slow SQL, and unstable dependencies.
Dependency stability is measured by the I metric; lower I indicates higher stability. Risks include multiple calls, circular calls, and bidirectional calls.
Conclusion
The article shares SRE fundamentals, how SRE can use technical debt as a lever to improve system health, and recommends classic books like "Clean Code", "Clean Architecture", and SRE references for deeper study.
References are listed at the end of the original article.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
HaoDF Tech Team
HaoDF Online tech practice and sharing—join us to discuss and help create quality healthcare through technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
