Operations 11 min read

Introduction to System Stability: Concepts, Metrics, and Practices

The article explains Haro’s approach to system stability—defining high‑availability, key metrics such as SLA, RPO/RTO, MTTR/MTBF, and the 5‑5‑10 rule—while outlining cultural and technical safeguards, full‑team participation, process integration, and incremental tooling to prevent faults and ensure rapid recovery.

HelloTech
HelloTech
HelloTech
Introduction to System Stability: Concepts, Metrics, and Practices

Haro, a mobility company, treats its two‑wheel business as essential public infrastructure. Even a minor fault can affect thousands of users, so ensuring system stability is critical.

The author, who has participated in Haro's stability engineering efforts, shares insights and experiences to review past work, summarize lessons, and invite discussion.

1. Definition of Stability

Stability (also called High Availability) refers to a system’s ability to run continuously in an expected state and handle user requests reliably. The core idea is long‑term operation within the anticipated state.

Stability is evaluated by three dimensions:

Service availability time

Data integrity

Emergency response efficiency

1.1 SLA (Service‑Level Agreement)

SLA measures the proportion of uptime to total time. The classic “3‑9”, “4‑9”, “5‑9” metrics indicate the maximum allowable downtime per year (e.g., 4‑9 means at most 52.56 minutes of downtime).

1.2 RPO and RTO (Data Integrity)

RPO (Recovery Point Objective) defines how far back data can be recovered, indicating potential data loss. RTO (Recovery Time Objective) defines the time needed to restore services after a disaster.

1.3 MTTR and MTBF (Emergency Response Efficiency)

MTTR (Mean Time To Repair) measures the time required to bring a failed system back to an operational state. MTBF (Mean Time Between Failures) measures the average interval between failures.

The “5‑5‑10” rule further breaks down MTTR: within 5 minutes the alarm must reach the responsible person, within the next 5 minutes the fault scope must be located, and within 10 minutes the core service must be restored.

2. Common Measures for Stability

Two key questions are addressed:

How to reduce the probability of faults before they occur.

How to quickly recover service after a fault.

Measures are organized into layers, ranging from cultural construction (processes, standards, “stability army rules”) to technical safeguards such as multi‑instance deployment, multi‑AZ, active‑active and multi‑active disaster recovery, and emergency response systems.

3. Methodology for Building Stability

1) Full‑team participation : Stability is not only the responsibility of NOC/SRE/technical‑risk teams but requires involvement from all developers, testers, middleware engineers, operations, and product managers.

2) Process closure : Embed stability checks throughout the software lifecycle—design reviews, static code analysis, dependency validation, staged releases, monitoring, and alerting.

3) Incremental tool and mechanism accumulation : Replace manual work with tools (inspection platforms, load‑testing platforms, drill platforms) and codify rules (e.g., “development army rules”).

4. Conclusion

This article is the first in a series on stability engineering. Future articles will cover dependency governance, full‑link load testing, fault drills, risk perception, and NOC emergency systems.

Readers are invited to comment, collaborate, and follow the “Haro Technology” public account for more technical sharing.

OperationsSLAMTTRHigh AvailabilityReliability EngineeringRPORTO
HelloTech
Written by

HelloTech

Official Hello technology account, sharing tech insights and developments.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.