Operations 11 min read

Master Availability, Reliability, and Stability for High‑Availability Systems

Understanding the differences between system availability, reliability, and stability is essential for building resilient services; this guide explains each concept, illustrates their distinctions with examples, and outlines practical strategies such as rate limiting, anti‑scraping, timeout settings, system inspections, and fault post‑mortems to reduce failures and downtime.

Senior Tony

Nov 14, 2023

Master Availability, Reliability, and Stability for High‑Availability Systems

Definitions

Availability means the system can work at any given moment, with typical industry targets of 99.9%–99.99% (downtime of 50–500 minutes per month). Reliability refers to the system’s ability to run without failures over a long period. Stability builds on availability and reliability, requiring consistent performance without fluctuations.

In practice, a system that crashes for one millisecond per hour technically exceeds 99.9999% availability but remains highly unreliable, while a system that never crashes but is down for two weeks each year is reliable yet only 96% available.

Why High‑Availability Matters

Interviewers who focus on high‑availability questions tend to have more experience than those who ask about high concurrency or performance, because availability concerns both uptime and quality of service.

Strategies to Build a High‑Availability System

1. Reduce Failure Frequency

Rate Limiting

Uncontrolled external traffic can overwhelm a system. Rate limiting protects the system by rejecting excess requests. Tools such as Guava’s RateLimiter or Spring Cloud Alibaba’s Sentinel provide robust functionality, but the key is setting appropriate thresholds and multi‑level controls.

Threshold Setting : If the threshold exceeds the system’s capacity, rate limiting is ineffective; if it is too low, legitimate traffic is blocked and resources are wasted.

Determine a reasonable threshold by replaying real traffic during low‑load periods and incrementally increasing load (1×, 1.5×, 2×, etc.) to find the maximum capacity, then set the limit to 50%–70% of that peak.

Multi‑Level Controls

Global rate limiting may not suffice for all scenarios. For high‑latency endpoints, apply additional per‑endpoint limits (total‑plus‑sub‑level control). This approach also applies when serving multiple external clients or business lines.

2. Anti‑Scraping (Prevent Abuse)

Anti‑scraping limits the number of requests a single user or IP can make to a specific endpoint, defending against DDoS attacks, ticket‑snatching bots, and accidental repeated calls.

Prefer early interception: first at the WAF, then Nginx, and finally at the application layer.

Common implementations:

WAF IP fencing and DDoS rate limiting.

Nginx limit_req_zone to cap request rates per IP.

Redis key based on username + method with an expiration to enforce per‑user limits.

3. Timeout Settings

Distinguish between strong and weak dependencies on downstream services. Strong dependencies cause the caller to fail when the downstream service is unavailable; weak dependencies allow the caller to continue with degraded functionality.

Set timeout values between the 99th percentile (TP99) and the 99.9th percentile (TP999) of response times. For example, if TP99 is 100 ms, a timeout of 200 ms balances protecting the system from long‑tail latency while not sacrificing most requests.

4. System Inspection

Conduct inspections after code releases or before peak traffic periods to uncover hidden issues.

Hardware metrics: CPU, memory, disk, network, JVM.

System metrics: QPS, TPS, response times, error rates.

Database health: new slow queries, execution counts, and durations.

Example: For an online‑education platform with peak usage from 19:00–21:00, stop deployments after 16:00, perform a thorough inspection at 17:00, and use the remaining two hours to fix any detected problems.

5. Fault Post‑Mortem

Use structured analysis methods (5 Whys, 5W2H, Golden Three Questions) to dig deep into the root cause of incidents, avoiding superficial explanations.

After identifying causes, categorize actions by urgency and importance, assign owners, set deadlines, and track progress until completion. Ensure TODO items are concrete (e.g., “add circuit breaker to service X”) rather than vague slogans.

Conclusion

This article covered the first pillar—reducing failure frequency—of high‑availability engineering. Future articles will explore lowering failure duration and shrinking failure scope.

Diagram illustrating availability, reliability, and stability

high availability Reliability rate limiting stability Availability system inspection

Written by

Senior Tony

Former senior tech manager at Meituan, ex‑tech director at New Oriental, with experience at JD.com and Qunar; specializes in Java interview coaching and regularly shares hardcore technical content. Runs a video channel of the same name.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.