How Alibaba Cloud OSS Achieves 99.995% Availability: Architecture & SLA Secrets
Alibaba Cloud Object Storage (OSS) boosts its availability SLA tenfold to 99.995% by employing rigorous 5‑minute error‑rate metrics, redundant local and city‑level architectures, sophisticated distributed systems like Nuwa and Pangu, robust IDC design, QoS, security defenses, and comprehensive management practices.
Overview
Object storage is widely used in Internet applications; when we watch videos, listen to music, share images, browse webpages, or shop online, the underlying data is stored in object storage. Application availability is directly linked to the storage service level agreement (SLA); higher SLA means better user experience.
Alibaba Cloud OSS has improved its availability SLA tenfold in June 2020, raising the standard‑type (local redundancy) SLA from 99.95% to 99.995%.
How to Measure OSS Availability SLA
Understanding OSS availability requires reviewing industry availability metrics and the underlying technology.
Industry Common Availability Metrics
Availability is often expressed as annual downtime. Data center tiers (T1‑T4) have the following metrics:
T1: 99.671% availability, 28.8 hours annual downtime
T2: 99.741% availability, 22 hours annual downtime
T3: 99.982% availability, 1.6 hours annual downtime
T4: 99.995% availability, 0.4 hours annual downtime
Typical “five‑9” availability corresponds to about 5 minutes of annual downtime.
OSS Uses a Stricter Metric
Object storage is a serverless API service; measuring availability by annual downtime is unsuitable. OSS calculates availability using the error rate "failed requests / total requests".
5‑Minute Error‑Rate Calculation
ErrorRate_5min = FailedRequests_5min / TotalValidRequests_5min * 100%Using a 5‑minute granularity aligns with typical machine‑failure recovery times and yields a more customer‑centric error rate.
Service Availability Based on 5‑Minute Error Rate
ServiceAvailability = (1 - Σ(ErrorRate_5min) / Total5minIntervals) * 100%OSS charges monthly, so the service period is a natural month (30 days = 8640 five‑minute intervals). The monthly availability is 1 minus the average 5‑minute error rate.
Model Comparison
Assuming 26 minutes of annual downtime, the traditional annual‑downtime model yields 99.995% availability. OSS’s error‑rate model distributes the same 26 minutes across months (≈2.16 minutes per month). If all requests fail during a 2.16‑minute window, the error rate is (2.16/5)*100%, leading to: 1-{(2.16/5)*100%}/8640 = 99.995% The result is comparable, but because OSS handles bandwidth‑intensive workloads, the actual availability may be slightly lower than 99.995%.
OSS Availability SLA Targets
After extensive technical refinement, OSS now offers 99.99% availability for standard‑type (local redundancy) storage and 99.995% for standard‑type (city‑level redundancy) storage, a ten‑fold improvement over previous values.
OSS Availability System Construction
OSS’s high‑availability system is built on multiple dimensions: architecture, IDC design, distributed systems, security, and management mechanisms.
Local Redundancy and City‑Level Redundancy Architecture
OSS provides two storage types: local redundancy (single AZ) and city‑level redundancy (three AZs). Both share the same logical modules: Nuwa consistency service, Pangu distributed file system, Youchao KV metadata, OSS service backend, and network load balancer.
City‑level redundancy distributes data copies across three AZs, providing disaster‑tolerance. In case of a data‑center outage, OSS continues to serve data with strong consistency, meeting RTO and RPO of zero.
IDC Redundancy Design
Physical redundancy includes:
Multi‑AZ distance and latency design to meet strict latency requirements.
Power and cooling redundancy: dual power feeds, diesel generators, continuous cooling.
Network redundancy: external BGP multi‑ISP links, VPC redundant connections, and internal multi‑layer switching.
External network uses multi‑operator BGP and static bandwidth; internal network employs tiered switches ensuring continuity even if a device fails.
Distributed System Design
Nuwa Consistency Service
Nuwa, a core module of Alibaba Cloud’s Feitian system, provides consistency, distributed locks, and notifications. Compared with open‑source solutions like ZooKeeper or etcd, Nuwa offers superior performance, scalability, and operability.
Nuwa uses a two‑layer architecture: front‑end machines with VIP load balancing handle long‑lived client connections and hide backend switches; the back‑end consists of multiple Paxos groups implementing the consensus protocol.
Pangu Distributed File System
Pangu 2.0 is a self‑developed distributed storage system offering high performance, massive scale, and low cost. Its metadata layers (RootServer, NameSpaceServer, MetaServer) are redundantly designed, and the data layer (ChunkServer) supports “None‑Stop‑Write” for rapid failover.
Youchao Distributed KV Metadata
Youchao KV stores metadata on top of Pangu, using partition groups with multiple replicas. A leader is elected via a consensus protocol; if the leader fails, a new leader is quickly chosen, ensuring high availability.
Object Service QoS
OSS’s service layer is stateless, allowing rapid failover. Multi‑tenant isolation and QoS monitoring are essential for guaranteeing tenant availability.
Network Load Balancing
To handle massive request volumes, OSS uses load balancers with VIP binding, integrating with front‑end clusters to achieve fast failover and high‑throughput access.
Security Protection
OSS offers HTTP/HTTPS access and must defend against DDoS and other attacks. Threats aim to degrade availability by congesting bandwidth or exhausting compute resources. Protection covers L3/L4 DDoS, layer‑4 CC, and layer‑7 CC attacks.
Management Mechanisms
Inventory Management: Predict resource demand and provision on‑demand to ensure availability.
Water‑Level Management: Monitor capacity, bandwidth, and QPS thresholds for dynamic scheduling.
Stability Culture: Establish stability standards across development, testing, and operations.
Double‑Eleven Hammering: Continuous high‑traffic testing during major sales events to refine architecture.
Future Work
Although OSS has achieved a ten‑fold SLA improvement, future efforts will focus on handling abnormal spikes, super‑hot spots, and high‑frequency attacks to further enhance availability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
