How to Achieve Five Nines: Practical High‑Availability Strategies for Modern Web Systems
This article explains key high‑availability concepts such as availability metrics, microservice modularization, load balancing, rate limiting, circuit breaking, isolation, retry strategies, rollback plans, stress testing, monitoring, and on‑call processes, providing concrete design guidelines for building resilient internet services.
Why High Availability Matters
In the Internet industry, especially for payment systems, high availability (HA) is a critical performance indicator. This guide summarizes essential HA practices based on real‑world experience.
Availability Metrics and Evaluation
Website downtime = fault repair timestamp – fault detection timestamp
Annual availability = (1 – downtime / total year time) × 100%
Reaching “three nines” (99.9%) is relatively easy with manual operations, while “four nines” (99.99%) requires a robust on‑call system, fault‑handling processes, and automated recovery. “Five nines” (99.999%) demands fully automated disaster‑recovery mechanisms because human response cannot meet the required speed.
System Modularity and Micro‑services
Monolithic back‑ends that host product, order, and payment services together cause a single failure to bring down the entire system. Modern micro‑service architectures split functionality by domain, isolating failures and forming the foundation of HA.
High‑Availability Design for Dependent Components (MySQL, Redis, etc.)
Critical middle‑wares must also be HA. For MySQL, use same‑city primary‑backup deployment with cross‑region disaster recovery and proxy services (CDB) to abstract the actual DB. For Redis, adopt Sentinel for automatic failover.
Load Balancing
Load balancing distributes traffic and eliminates single points of failure. Common solutions include:
LVS – Linux Virtual Server provides high‑performance, scalable, reliable load balancing across data centers.
Nginx – Often sits behind LVS to handle HTTP/HTTPS traffic.
API gateway – Deploy multiple replicas for high availability.
Application services – Each micro‑service instance participates in load balancing.
Rate Limiting
Rate limiting protects the system by restricting the number of concurrent requests.
1. Single‑machine rate limiting – Uses in‑memory counters (e.g., AtomicLong.incrementAndGet()) but cannot enforce global limits.
2. Distributed rate limiting – Controls traffic at the cluster level, protecting downstream services.
Rate limiting supports multiple dimensions:Total requests per time window (e.g., per minute).
Per‑API request volume.
Per‑IP, city, channel, device ID, user ID, etc.
Per‑appkey rules for open platforms.
Common algorithms: counter, leaky bucket, token bucket.
Circuit Breaking (Fail‑Fast)
Circuit breaking limits calls to an unstable resource, causing immediate failures to prevent cascading errors. Implement fail‑fast logic to return errors quickly and let upstream services handle them.
Isolation
Isolation separates services physically or logically, reducing coupling. Each subsystem has its own codebase, deployment, and can be isolated at the thread level as well.
Timeouts and Retries
Network unreliability makes timeouts common. Retries improve user experience but must be combined with idempotency to avoid duplicate actions (e.g., double bank transfers). Use idempotent keys in request headers.
Rollback
New feature releases often introduce bugs; a rollback plan is essential to revert quickly when issues arise.
Stress Testing and Contingency Plans
Stress testing defines load, strategies, and metrics (QPS, response time, success rate). Types include single‑machine, cluster, full‑link, read/write, simulation, and isolation‑cluster tests.
Emergency plans should cover every layer:
Network layer (DNS, LVS, HAProxy)
Application entry (Nginx, OpenResty)
Web layer (Tomcat)
Service layer (Dubbo)
Data layer (Redis, DB)
Monitoring and Alerting
Comprehensive metrics (hardware, JVM, business, logs) and alert thresholds are vital. Most companies use an “eagle‑eye” monitoring system to detect issues instantly.
On‑Call System and Release Checklist
A mature on‑call rotation and a detailed release checklist dramatically reduce incidents caused by new feature deployments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect's Guide
Dedicated to sharing programmer-architect skills—Java backend, system, microservice, and distributed architectures—to help you become a senior architect.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
