Operations 13 min read

Building a Comprehensive High‑Availability System: Disaster Recovery, Capacity Planning, Online Protection, and Fault Drills

This article explains how to construct a truly high‑availability architecture for modern distributed, cloud‑native services by covering disaster‑recovery principles, capacity planning with realistic load testing, online traffic protection, and systematic fault‑drill practices.

Full-Stack Internet Architecture

Jul 22, 2020

With the rapid growth of online business, monolithic "one‑machine + database" architectures no longer meet the demands of large‑scale, distributed, cloud‑native systems, making stability assurance increasingly complex.

Disaster Recovery

Inspired by aviation safety, a robust disaster‑recovery plan considers three dimensions: people, aircraft, and environment. Redundancy through isolation (multiple pilots, duplicate aircraft systems, weather radar, collision‑avoidance, and blind‑landing aids) translates to multiple servers, data replicas, and backup sites in IT. Key metrics are RPO (Recovery Point Objective) and RTO (Recovery Time Objective).

Industry‑Standard DR Solutions

Traditional off‑site cold standby evolved to same‑city active‑active and finally to the "two‑region three‑center" model. Alibaba's AHAS adopts an even more advanced "multi‑active across regions" approach, achieving minute‑ or second‑level RPO/RTO and enabling seamless traffic switchover when a data center fails.

Capacity Planning

Internet traffic spikes (e.g., hot topics, Double 11, ticket releases) require precise capacity forecasting. Traditional performance tests are insufficient; real‑time throughput must be measured under realistic traffic models, scales, and environments.

Key characteristics of modern load testing:

Emphasize traffic realism

Support large‑scale execution

Be simple and easy to use

Alibaba's PTS‑based traffic engine provides nationwide, carrier‑wide traffic, supports up to 3 kW QPS, offers visual orchestration, integrates with AHAS for flow control, capacity water‑level management, and circuit‑breaker protection, and is compatible with JMeter scripts.

Full‑Link Load Testing

Single‑service tests miss many systemic issues; full‑link testing reproduces end‑to‑end traffic to expose problems that only appear under real‑world load. Alibaba Cloud’s PTS full‑link solution includes realistic environment replication, data sampling, traffic modeling, traffic generation, and multi‑dimensional monitoring for issue localization.

Online Protection

As distributed components increase, the likelihood of faults rises. AHAS provides a comprehensive protection stack—from entry point to backend—covering traffic throttling (QPS limits, warm‑up, queuing), anomaly isolation (e.g., slow SQL, deadlocks), and system protection that dynamically balances load to prevent service degradation.

Fault Drills

Low‑probability failures can cause massive losses; systematic fault‑drill exercises are essential for fast‑growing, complex architectures. Alibaba’s fault‑drill platform, launched commercially in 2018 and open‑sourced in 2019, offers visual, safe, low‑cost drills with automatic protection policies to avoid unintended outages.

Overall, a high‑availability system combines layered disaster recovery, precise capacity planning, realistic full‑link testing, proactive online protection, and regular fault‑drill rehearsals to ensure resilient operation of large‑scale internet services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

capacity planning Disaster Recovery Fault Injection online protection

Written by

Full-Stack Internet Architecture

Introducing full-stack Internet architecture technologies centered on Java

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.