Cloud Computing 16 min read

How Alibaba Cloud OSS Achieves 99.995% Availability: Architecture & SLA Secrets

Alibaba Cloud Object Storage (OSS) boosts its availability SLA tenfold to 99.995% by employing rigorous 5‑minute error‑rate metrics, redundant local and city‑level architectures, sophisticated distributed systems like Nuwa and Pangu, robust IDC design, QoS, security defenses, and comprehensive management practices.

Alibaba Cloud Developer

Jul 1, 2020

How Alibaba Cloud OSS Achieves 99.995% Availability: Architecture & SLA Secrets

Overview

Object storage is widely used in Internet applications; when we watch videos, listen to music, share images, browse webpages, or shop online, the underlying data is stored in object storage. Application availability is directly linked to the storage service level agreement (SLA); higher SLA means better user experience.

Alibaba Cloud OSS has improved its availability SLA tenfold in June 2020, raising the standard‑type (local redundancy) SLA from 99.95% to 99.995%.

How to Measure OSS Availability SLA

Understanding OSS availability requires reviewing industry availability metrics and the underlying technology.

Industry Common Availability Metrics

Availability is often expressed as annual downtime. Data center tiers (T1‑T4) have the following metrics:

T1: 99.671% availability, 28.8 hours annual downtime

T2: 99.741% availability, 22 hours annual downtime

T3: 99.982% availability, 1.6 hours annual downtime

T4: 99.995% availability, 0.4 hours annual downtime

Typical “five‑9” availability corresponds to about 5 minutes of annual downtime.

OSS Uses a Stricter Metric

Object storage is a serverless API service; measuring availability by annual downtime is unsuitable. OSS calculates availability using the error rate "failed requests / total requests".

5‑Minute Error‑Rate Calculation

ErrorRate_5min = FailedRequests_5min / TotalValidRequests_5min * 100%

Using a 5‑minute granularity aligns with typical machine‑failure recovery times and yields a more customer‑centric error rate.

Service Availability Based on 5‑Minute Error Rate

ServiceAvailability = (1 - Σ(ErrorRate_5min) / Total5minIntervals) * 100%

OSS charges monthly, so the service period is a natural month (30 days = 8640 five‑minute intervals). The monthly availability is 1 minus the average 5‑minute error rate.

Model Comparison

Assuming 26 minutes of annual downtime, the traditional annual‑downtime model yields 99.995% availability. OSS’s error‑rate model distributes the same 26 minutes across months (≈2.16 minutes per month). If all requests fail during a 2.16‑minute window, the error rate is (2.16/5)*100%, leading to: 1-{(2.16/5)*100%}/8640 = 99.995% The result is comparable, but because OSS handles bandwidth‑intensive workloads, the actual availability may be slightly lower than 99.995%.

OSS Availability SLA Targets

After extensive technical refinement, OSS now offers 99.99% availability for standard‑type (local redundancy) storage and 99.995% for standard‑type (city‑level redundancy) storage, a ten‑fold improvement over previous values.

OSS Availability System Construction

OSS’s high‑availability system is built on multiple dimensions: architecture, IDC design, distributed systems, security, and management mechanisms.

Local Redundancy and City‑Level Redundancy Architecture

OSS provides two storage types: local redundancy (single AZ) and city‑level redundancy (three AZs). Both share the same logical modules: Nuwa consistency service, Pangu distributed file system, Youchao KV metadata, OSS service backend, and network load balancer.

City‑level redundancy distributes data copies across three AZs, providing disaster‑tolerance. In case of a data‑center outage, OSS continues to serve data with strong consistency, meeting RTO and RPO of zero.

IDC Redundancy Design

Physical redundancy includes:

Multi‑AZ distance and latency design to meet strict latency requirements.

Power and cooling redundancy: dual power feeds, diesel generators, continuous cooling.

Network redundancy: external BGP multi‑ISP links, VPC redundant connections, and internal multi‑layer switching.

External network uses multi‑operator BGP and static bandwidth; internal network employs tiered switches ensuring continuity even if a device fails.

Distributed System Design

Nuwa Consistency Service

Nuwa, a core module of Alibaba Cloud’s Feitian system, provides consistency, distributed locks, and notifications. Compared with open‑source solutions like ZooKeeper or etcd, Nuwa offers superior performance, scalability, and operability.

Nuwa uses a two‑layer architecture: front‑end machines with VIP load balancing handle long‑lived client connections and hide backend switches; the back‑end consists of multiple Paxos groups implementing the consensus protocol.

Pangu Distributed File System

Pangu 2.0 is a self‑developed distributed storage system offering high performance, massive scale, and low cost. Its metadata layers (RootServer, NameSpaceServer, MetaServer) are redundantly designed, and the data layer (ChunkServer) supports “None‑Stop‑Write” for rapid failover.

Youchao Distributed KV Metadata

Youchao KV stores metadata on top of Pangu, using partition groups with multiple replicas. A leader is elected via a consensus protocol; if the leader fails, a new leader is quickly chosen, ensuring high availability.

Object Service QoS

OSS’s service layer is stateless, allowing rapid failover. Multi‑tenant isolation and QoS monitoring are essential for guaranteeing tenant availability.

Network Load Balancing

To handle massive request volumes, OSS uses load balancers with VIP binding, integrating with front‑end clusters to achieve fast failover and high‑throughput access.

Security Protection

OSS offers HTTP/HTTPS access and must defend against DDoS and other attacks. Threats aim to degrade availability by congesting bandwidth or exhausting compute resources. Protection covers L3/L4 DDoS, layer‑4 CC, and layer‑7 CC attacks.

Management Mechanisms

Inventory Management: Predict resource demand and provision on‑demand to ensure availability.

Water‑Level Management: Monitor capacity, bandwidth, and QPS thresholds for dynamic scheduling.

Stability Culture: Establish stability standards across development, testing, and operations.

Double‑Eleven Hammering: Continuous high‑traffic testing during major sales events to refine architecture.

Future Work

Although OSS has achieved a ten‑fold SLA improvement, future efforts will focus on handling abnormal spikes, super‑hot spots, and high‑frequency attacks to further enhance availability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SLA cloud storage OSS Alibaba Cloud

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.