Operations 12 min read

How to Achieve Five Nines: Practical High‑Availability Strategies for Modern Web Systems

This article explains key high‑availability concepts such as availability metrics, microservice modularization, load balancing, rate limiting, circuit breaking, isolation, retry strategies, rollback plans, stress testing, monitoring, and on‑call processes, providing concrete design guidelines for building resilient internet services.

Architect's Guide
Architect's Guide
Architect's Guide
How to Achieve Five Nines: Practical High‑Availability Strategies for Modern Web Systems

Why High Availability Matters

In the Internet industry, especially for payment systems, high availability (HA) is a critical performance indicator. This guide summarizes essential HA practices based on real‑world experience.

Availability Metrics and Evaluation

Website downtime = fault repair timestamp – fault detection timestamp

Annual availability = (1 – downtime / total year time) × 100%

Reaching “three nines” (99.9%) is relatively easy with manual operations, while “four nines” (99.99%) requires a robust on‑call system, fault‑handling processes, and automated recovery. “Five nines” (99.999%) demands fully automated disaster‑recovery mechanisms because human response cannot meet the required speed.

System Modularity and Micro‑services

Monolithic back‑ends that host product, order, and payment services together cause a single failure to bring down the entire system. Modern micro‑service architectures split functionality by domain, isolating failures and forming the foundation of HA.

High‑Availability Design for Dependent Components (MySQL, Redis, etc.)

Critical middle‑wares must also be HA. For MySQL, use same‑city primary‑backup deployment with cross‑region disaster recovery and proxy services (CDB) to abstract the actual DB. For Redis, adopt Sentinel for automatic failover.

Load Balancing

Load balancing distributes traffic and eliminates single points of failure. Common solutions include:

LVS – Linux Virtual Server provides high‑performance, scalable, reliable load balancing across data centers.

Nginx – Often sits behind LVS to handle HTTP/HTTPS traffic.

API gateway – Deploy multiple replicas for high availability.

Application services – Each micro‑service instance participates in load balancing.

Rate Limiting

Rate limiting protects the system by restricting the number of concurrent requests.

1. Single‑machine rate limiting – Uses in‑memory counters (e.g., AtomicLong.incrementAndGet()) but cannot enforce global limits.

2. Distributed rate limiting – Controls traffic at the cluster level, protecting downstream services.

Rate limiting supports multiple dimensions:

Total requests per time window (e.g., per minute).

Per‑API request volume.

Per‑IP, city, channel, device ID, user ID, etc.

Per‑appkey rules for open platforms.

Common algorithms: counter, leaky bucket, token bucket.

Circuit Breaking (Fail‑Fast)

Circuit breaking limits calls to an unstable resource, causing immediate failures to prevent cascading errors. Implement fail‑fast logic to return errors quickly and let upstream services handle them.

Isolation

Isolation separates services physically or logically, reducing coupling. Each subsystem has its own codebase, deployment, and can be isolated at the thread level as well.

Timeouts and Retries

Network unreliability makes timeouts common. Retries improve user experience but must be combined with idempotency to avoid duplicate actions (e.g., double bank transfers). Use idempotent keys in request headers.

Rollback

New feature releases often introduce bugs; a rollback plan is essential to revert quickly when issues arise.

Stress Testing and Contingency Plans

Stress testing defines load, strategies, and metrics (QPS, response time, success rate). Types include single‑machine, cluster, full‑link, read/write, simulation, and isolation‑cluster tests.

Emergency plans should cover every layer:

Network layer (DNS, LVS, HAProxy)

Application entry (Nginx, OpenResty)

Web layer (Tomcat)

Service layer (Dubbo)

Data layer (Redis, DB)

Monitoring and Alerting

Comprehensive metrics (hardware, JVM, business, logs) and alert thresholds are vital. Most companies use an “eagle‑eye” monitoring system to detect issues instantly.

On‑Call System and Release Checklist

A mature on‑call rotation and a detailed release checklist dramatically reduce incidents caused by new feature deployments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringMicroserviceshigh availabilityload balancingrate limitingCircuit Breaking
Architect's Guide
Written by

Architect's Guide

Dedicated to sharing programmer-architect skills—Java backend, system, microservice, and distributed architectures—to help you become a senior architect.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.