Operations 15 min read

System Reliability and Availability: Insights from the Alipay Outage and YunOS

The article examines system reliability concepts such as availability, MTBF, MTTR, and outage classifications, analyzes the Alipay service interruption, discusses various redundancy and failover strategies, and explores YunOS reliability testing and design practices to improve overall system robustness.

Architect

Jan 22, 2016

System Reliability and Availability: Insights from the Alipay Outage and YunOS

System Reliability Overview

System reliability is measured by availability, with classic definitions of MTBF (Mean Time Between Failures), MTTR (Mean Time To Repair), and MTTF (Mean Time To Failure). Availability is often calculated as (total service time – outage time) / total service time.

Outage Types

Outages are categorized by cause: product‑related, customer‑related, and external factors. TL9000 defines Total Outage (all major functions fail) and Partial Outage (service capacity reduced, e.g., 20%). Minor interruptions (<10% capacity loss or <15 seconds) may not be counted as outages for high‑availability targets.

Five Nines, DPM and Failover Recovery Time

Five‑nine availability (99.999%) corresponds to less than 5.26 minutes of downtime per year. DPM (defects per million requests) relates to availability levels: 1 DPM ≈ six nines, 10 DPM ≈ five nines, 1000 DPM ≈ three nines. High availability also requires rapid fault detection and failover within seconds.

Alipay Service Outage Analysis

The incident was caused by an external fiber cut, leading to a partial outage of about two hours. Some services used geo‑redundant multi‑active deployment and switched instantly, while others suffered the full outage. The overall availability for the year is estimated between 99.9% and 99.99% (between three and four nines).

Redundancy Strategies

Various disaster‑recovery models are described:

No fault tolerance: system cannot recover automatically.

Active‑Standby (1+1): a backup service activates when the primary fails.

Active‑Active: two clusters serve simultaneously; the surviving cluster takes over if one fails.

Active‑Active with Geo‑Redundancy: clusters are in different geographic locations.

Multi‑Active with Geo‑Redundancy: multiple clusters across regions, offering the highest resilience.

N+K: the system can sustain K node failures while maintaining N node capacity.

YunOS Reliability

YunOS is Alibaba’s smart‑device operating system, consisting of cloud services and client applications. Reliability testing on the client side uses MTBF tests, where multiple devices run continuous workloads 24/7 and record failures such as crashes or reboots. The MTBF value is calculated as total runtime divided by the number of failures.

MTBF testing differs from service‑level availability but shares the goal of reducing downtime and improving fault‑tolerance. Concepts such as treating failures as outages, applying DPM metrics, and implementing fast failover recovery are applicable to both mobile and server environments.

Reliability Design, Modeling, and Testing

Reliability must be built into system architecture and design, not achieved solely through testing. Designers should model component failure probabilities, allocate downtime budgets, and incorporate redundancy strategies to meet target availability levels.

For example, a Downtime Budget diagram can help identify components that need improvement.

Conclusion

System reliability and overall product quality cannot rely on testing alone; they must be embedded in the design phase. Proper requirements, architecture, and fault‑tolerance mechanisms are essential for achieving high availability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud computing system reliability disaster recovery Availability MTBF YunOS

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.