Key Practices for Achieving High Availability in Internet Services
The article outlines essential high‑availability techniques for internet‑scale systems, covering availability metrics, microservice modularization, database redundancy, load balancing, rate limiting, circuit breaking, isolation, retry strategies, rollback plans, stress testing, monitoring, and on‑call procedures.
Today we discuss the most valued performance metric in internet business— high availability —and share practical points drawn from a payment system implementation.
Availability Measurement and Evaluation
Website downtime = fault recovery time – fault detection time. Annual availability = (1 – downtime / total year time) × 100%.
Reaching three‑nines (99.9%) is easy with manual operations; four‑nines (99.99%) requires robust on‑call teams, fault‑handling processes, and automated recovery; five‑nines (99.999%) demands full disaster‑recovery and self‑healing mechanisms.
System Modularity and Micro‑service Architecture
Monolithic back‑ends cause a single service failure to bring down all functions (orders, payments, etc.). Micro‑service decomposition isolates failures, allowing each service to operate independently.
High‑Availability Design for Dependent Components (MySQL, Redis)
Critical middle‑wares must also be highly available. For MySQL we use same‑city master‑slave with remote disaster‑recovery and proxy routing; for Redis we employ Sentinel for automatic failover.
Load Balancing
Key load‑balancing tools include LVS (Linux Virtual Server) for high‑performance distribution, Nginx as a secondary balancer, and API gateways with multiple replicas to ensure service continuity.
Rate Limiting
Rate limiting caps concurrent requests to protect system stability, implemented either as single‑node counters (e.g., AtomicLong#incrementAndGet()) or distributed algorithms (token‑bucket, leaky‑bucket) across clusters.
Circuit Breaker (Fail‑Fast)
When a downstream resource becomes unstable, a circuit breaker quickly fails calls to prevent cascading errors, allowing higher‑level services to handle the fault.
Isolation
Physical isolation separates low‑coupling subsystems into independent deployments, reducing fault impact; thread‑level isolation further protects a single service.
Timeout and Retry
Network unreliability necessitates retries, but retries must be combined with idempotency to avoid duplicate operations, especially in financial transactions.
Rollback
New feature releases should include a rollback plan to revert quickly if issues arise.
Stress Testing and Contingency Planning
Stress testing defines load, strategy, and metrics (QPS, latency, success rate) across single‑machine, cluster, and full‑link scenarios; contingency plans cover each layer—from DNS/LVS to database.
Monitoring and Alerting
Comprehensive metrics (hardware, JVM, business, logs) and alert thresholds enable rapid detection and response to incidents.
On‑Call System and Release Checklist
A mature on‑call rotation and release checklist dramatically reduce post‑deployment incidents, ensuring high‑availability goals are met.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
IT Architects Alliance
Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
