Master Availability, Reliability, and Stability for High‑Availability Systems
Understanding the differences between system availability, reliability, and stability is essential for building resilient services; this guide explains each concept, illustrates their distinctions with examples, and outlines practical strategies such as rate limiting, anti‑scraping, timeout settings, system inspections, and fault post‑mortems to reduce failures and downtime.
Definitions
Availability means the system can work at any given moment, with typical industry targets of 99.9%–99.99% (downtime of 50–500 minutes per month). Reliability refers to the system’s ability to run without failures over a long period. Stability builds on availability and reliability, requiring consistent performance without fluctuations.
In practice, a system that crashes for one millisecond per hour technically exceeds 99.9999% availability but remains highly unreliable, while a system that never crashes but is down for two weeks each year is reliable yet only 96% available.
Why High‑Availability Matters
Interviewers who focus on high‑availability questions tend to have more experience than those who ask about high concurrency or performance, because availability concerns both uptime and quality of service.
Strategies to Build a High‑Availability System
1. Reduce Failure Frequency
Rate Limiting
Uncontrolled external traffic can overwhelm a system. Rate limiting protects the system by rejecting excess requests. Tools such as Guava’s RateLimiter or Spring Cloud Alibaba’s Sentinel provide robust functionality, but the key is setting appropriate thresholds and multi‑level controls.
Threshold Setting : If the threshold exceeds the system’s capacity, rate limiting is ineffective; if it is too low, legitimate traffic is blocked and resources are wasted.
Determine a reasonable threshold by replaying real traffic during low‑load periods and incrementally increasing load (1×, 1.5×, 2×, etc.) to find the maximum capacity, then set the limit to 50%–70% of that peak.
Multi‑Level Controls
Global rate limiting may not suffice for all scenarios. For high‑latency endpoints, apply additional per‑endpoint limits (total‑plus‑sub‑level control). This approach also applies when serving multiple external clients or business lines.
2. Anti‑Scraping (Prevent Abuse)
Anti‑scraping limits the number of requests a single user or IP can make to a specific endpoint, defending against DDoS attacks, ticket‑snatching bots, and accidental repeated calls.
Prefer early interception: first at the WAF, then Nginx, and finally at the application layer.
Common implementations:
WAF IP fencing and DDoS rate limiting.
Nginx limit_req_zone to cap request rates per IP.
Redis key based on username + method with an expiration to enforce per‑user limits.
3. Timeout Settings
Distinguish between strong and weak dependencies on downstream services. Strong dependencies cause the caller to fail when the downstream service is unavailable; weak dependencies allow the caller to continue with degraded functionality.
Set timeout values between the 99th percentile (TP99) and the 99.9th percentile (TP999) of response times. For example, if TP99 is 100 ms, a timeout of 200 ms balances protecting the system from long‑tail latency while not sacrificing most requests.
4. System Inspection
Conduct inspections after code releases or before peak traffic periods to uncover hidden issues.
Hardware metrics: CPU, memory, disk, network, JVM.
System metrics: QPS, TPS, response times, error rates.
Database health: new slow queries, execution counts, and durations.
Example: For an online‑education platform with peak usage from 19:00–21:00, stop deployments after 16:00, perform a thorough inspection at 17:00, and use the remaining two hours to fix any detected problems.
5. Fault Post‑Mortem
Use structured analysis methods (5 Whys, 5W2H, Golden Three Questions) to dig deep into the root cause of incidents, avoiding superficial explanations.
After identifying causes, categorize actions by urgency and importance, assign owners, set deadlines, and track progress until completion. Ensure TODO items are concrete (e.g., “add circuit breaker to service X”) rather than vague slogans.
Conclusion
This article covered the first pillar—reducing failure frequency—of high‑availability engineering. Future articles will explore lowering failure duration and shrinking failure scope.
Senior Tony
Former senior tech manager at Meituan, ex‑tech director at New Oriental, with experience at JD.com and Qunar; specializes in Java interview coaching and regularly shares hardcore technical content. Runs a video channel of the same name.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
