Operations 10 min read

Key Practices for Achieving High Availability in Internet Services

The article outlines essential high‑availability techniques for internet‑scale systems, covering availability metrics, microservice modularization, database redundancy, load balancing, rate limiting, circuit breaking, isolation, retry strategies, rollback plans, stress testing, monitoring, and on‑call procedures.

IT Architects Alliance

Mar 14, 2023

Key Practices for Achieving High Availability in Internet Services

Today we discuss the most valued performance metric in internet business— high availability —and share practical points drawn from a payment system implementation.

Availability Measurement and Evaluation

Website downtime = fault recovery time – fault detection time. Annual availability = (1 – downtime / total year time) × 100%.

Reaching three‑nines (99.9%) is easy with manual operations; four‑nines (99.99%) requires robust on‑call teams, fault‑handling processes, and automated recovery; five‑nines (99.999%) demands full disaster‑recovery and self‑healing mechanisms.

System Modularity and Micro‑service Architecture

Monolithic back‑ends cause a single service failure to bring down all functions (orders, payments, etc.). Micro‑service decomposition isolates failures, allowing each service to operate independently.

High‑Availability Design for Dependent Components (MySQL, Redis)

Critical middle‑wares must also be highly available. For MySQL we use same‑city master‑slave with remote disaster‑recovery and proxy routing; for Redis we employ Sentinel for automatic failover.

Load Balancing

Key load‑balancing tools include LVS (Linux Virtual Server) for high‑performance distribution, Nginx as a secondary balancer, and API gateways with multiple replicas to ensure service continuity.

Rate Limiting

Rate limiting caps concurrent requests to protect system stability, implemented either as single‑node counters (e.g., AtomicLong#incrementAndGet()) or distributed algorithms (token‑bucket, leaky‑bucket) across clusters.

Circuit Breaker (Fail‑Fast)

When a downstream resource becomes unstable, a circuit breaker quickly fails calls to prevent cascading errors, allowing higher‑level services to handle the fault.

Isolation

Physical isolation separates low‑coupling subsystems into independent deployments, reducing fault impact; thread‑level isolation further protects a single service.

Timeout and Retry

Network unreliability necessitates retries, but retries must be combined with idempotency to avoid duplicate operations, especially in financial transactions.

Rollback

New feature releases should include a rollback plan to revert quickly if issues arise.

Stress Testing and Contingency Planning

Stress testing defines load, strategy, and metrics (QPS, latency, success rate) across single‑machine, cluster, and full‑link scenarios; contingency plans cover each layer—from DNS/LVS to database.

Monitoring and Alerting

Comprehensive metrics (hardware, JVM, business, logs) and alert thresholds enable rapid detection and response to incidents.

On‑Call System and Release Checklist

A mature on‑call rotation and release checklist dramatically reduce post‑deployment incidents, ensuring high‑availability goals are met.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations High Availability load balancing System Design rate limiting

Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.