Operations 27 min read

Designing High‑Availability Systems: Principles and Practices Across Six Layers

This article systematically explores high‑availability system design from development standards, capacity planning, application services, storage, product strategies, operations deployment, to incident response, presenting key concepts, architectural patterns, and practical guidelines for building resilient services.

Architecture Digest

Dec 21, 2022

Designing High‑Availability Systems: Principles and Practices Across Six Layers

Click below Follow my public account and star it to receive the latest articles instantly~

This article analyzes the key designs and considerations required for a highly available system from six perspectives: development standards, application services, storage, product, operations deployment, and incident response.

1. High‑Availability Architecture and System Design Philosophy

Availability and High‑Availability Concepts

Availability is a quantifiable metric defined as the proportion of total operational time that a system is functional, often expressed as a number of nines (e.g., 99.99% for four nines). High‑availability (HA) refers to a system’s ability to operate continuously without interruption, acknowledging that 100% availability is impossible.

High‑Availability System Design Principles

Designing HA systems requires a scientific engineering management approach that considers product, development, operations, and infrastructure holistically. Key design considerations include:

Establish development standards – enforce consistent design documents, coding conventions, and review processes.

Capacity planning and evaluation – assess expected traffic volumes and ensure the architecture can handle peak loads.

Service‑level HA – implement load balancing, elastic scaling, asynchronous decoupling, fault tolerance, and overload protection.

Storage‑level HA – use redundancy, hot/cold backups, and failover mechanisms.

Operations‑level HA – adopt testing, monitoring, alerting, disaster recovery, and chaos engineering.

Product‑level HA – define fallback strategies.

Emergency response plans – prepare rapid recovery procedures for incidents.

2. Development Standards Layer

Design and Coding Standards

Development standards cover the entire lifecycle from design documentation to code and release. Recommended practices include:

Define a unified design document template and conduct mandatory reviews for new, refactored, or large‑scale projects.

Avoid excessive logging; adopt centralized remote logging and distributed tracing.

Maintain unit test coverage (e.g., 50% overall) and enforce language‑specific coding guidelines.

Standardize project layout and directory structures.

Capacity Planning and Evaluation

Capacity evaluation estimates average and peak request volumes based on product forecasts or historical data. Capacity planning determines the target traffic scale (e.g., tens of thousands to millions of requests) and guides architectural choices. Performance stress testing, focusing on QPS and response latency, validates the accuracy of capacity plans.

QPS Estimation (Funnel Model)

A funnel model estimates QPS at each processing stage, recognizing that downstream layers receive progressively fewer requests due to filtering (e.g., page view → product detail → order).

3. Application Service Layer

Stateless and Load‑Balancing Design

Stateless services enable multiple instances for higher concurrency and availability. Load balancing (via service discovery, LVS, Nginx, etc.) distributes traffic across instances and handles health checks and automatic removal of failed nodes.

Elastic Scaling Design

Elastic scaling adjusts resources based on traffic spikes. In cloud‑native environments, Kubernetes auto‑scales pods based on CPU usage; on physical servers, custom monitoring and scaling scripts are required.

Asynchronous Decoupling and Throttling (Message Queue)

Message queues (e.g., Kafka) transform synchronous flows into asynchronous ones, providing decoupling and traffic smoothing, which improves overall system resilience.

Failure and Fault‑Tolerance Design

Adopt a “design for failure” mindset: fail fast, implement self‑protection, and apply fallback mechanisms when downstream services degrade.

Overload Protection Design (Rate Limiting, Circuit Breaking, Degradation)

Implement rate limiting to reject excess requests, circuit breaking to isolate failing downstream services, and degradation to disable non‑critical features under overload.

4. Storage Layer

Data storage HA is more complex due to statefulness. Common approaches include cluster storage (primary‑backup or primary‑replica) and distributed storage (e.g., HDFS, HBase, Elasticsearch). Each method addresses data replication, node role detection, and failover.

Cluster Storage (Centralized Storage)

Typical primary‑backup or primary‑replica setups replicate writes from the primary to backups, handle synchronization latency, and support automatic failover.

Distributed Storage

Distributed storage spreads data across many nodes, eliminating a single write master and requiring a coordinator for data placement. It suits massive data volumes.

5. Product Layer

Product‑level HA focuses on fallback UI/UX strategies such as default pages, graceful error messages, maintenance screens, and placeholder items for features like lotteries.

6. Operations and Deployment Layer

Development Phase – Canary Release and Interface Testing

Gradual rollout (canary) and comprehensive interface test suites ensure stable releases.

Development Phase – Monitoring and Alert Design

Monitoring stacks (ELK, Prometheus, OpenTracing, OpenTelemetry) collect logs, metrics, and traces. Alerts must be real‑time, comprehensive, tiered, and delivered via multiple channels (SMS, email, dashboards).

Development Phase – Security and Attack Prevention

Implement unified traffic gating, authentication, and service‑level authorization to mitigate abuse and attacks.

Deployment Phase – Multi‑Data‑Center Deployment (Disaster Recovery)

Stateless services can be replicated across data centers with service discovery; stateful storage requires careful replication and consistency handling.

Online Operation Phase – Failure Drills (Chaos Experiments)

Simulate outages (power loss, network cuts, service crashes) to validate system resilience, following practices pioneered by Netflix’s Chaos Monkey.

Online Operation Phase – Interface Probing

Periodic health checks of critical APIs trigger alerts when failures are detected.

7. Incident Response Layer

Pre‑defined emergency response procedures guide rapid recovery actions to minimize impact when incidents occur.

~ ~ ~