Cloud Computing 36 min read

How Baidu’s ARIES Powers Exabyte-Scale Cloud Storage for Baidu Netdisk

This article presents a comprehensive overview of Baidu’s ARIES storage platform, detailing its design philosophy, architecture, key concepts, and engineering challenges, and explains how it underpins Baidu Netdisk’s massive data‑plane storage with high availability, cost‑performance trade‑offs, and robust monitoring.

Baidu Intelligent Cloud Tech Hub

Nov 28, 2022

How Baidu’s ARIES Powers Exabyte-Scale Cloud Storage for Baidu Netdisk

Background

The presentation introduces Baidu Canghai’s storage platform ARIES (A Reliable and Integrated Exabytes Storage) and its role as the data‑plane foundation for Baidu Netdisk.

Large‑Scale Data Storage Overview

Data growth trends are illustrated, followed by a three‑layer classification of data (basic, structural, and application layers) and a discussion of the scope of large‑scale storage, including modeling, access interfaces, distribution, replication, fault tolerance, backup, performance‑cost trade‑offs, and system operations.

Challenges

Challenges are divided into objective‑world factors (access behavior, distributed environment complexity, resource limits, cost‑performance‑durability trade‑offs) and organizational‑cultural factors (team expertise, demand priorities, organizational structure, culture).

ARIES Architecture

ARIES consists of four subsystems: resource‑management (Master, DataNode), user‑access (DataAgent, VolumeService, Allocator, StateService), repair/validation/cleanup (CheckService, TinkerService, Validator, Gardener), and tape‑storage (TapeService, TapeNode). The system follows a micro‑service, sub‑system design with high cohesion and low coupling.

Data Model

ARIES uses Slice as the basic immutable entity (default 4 MiB). A 128‑bit Slice ID encodes Volume ID (cluster ID + intra‑cluster ID) and Slice sequence (process version + local ID). Slices are encoded with Direct Erasure Coding (EC) and stored as Shard units on DataNodes. Volumes (tens of GB to 1 TiB) contain Volumelets, which map to physical storage on DataNodes. Table Spaces group Volumes with the same EC parameters and bind them to resource pools.

High Availability Design

Both Put and Get paths are designed for HA: API can retry any DataAgent; Allocator uses consistent hashing for redundancy; writes employ quorum‑based 1PC; reads retrieve the minimum required shards for decoding and use backup requests. No single‑point bottlenecks exist, and the system tolerates failures at rack, DC, and AZ levels, with conditional multi‑AZ disaster recovery.

Data Reliability Mechanisms

Reliability is ensured through real‑time end‑to‑end checks, background correctness/consistency/completeness audits, and cross‑system validation between PCS and ARIES. Automatic fault detection triggers data repair at the Volume level with Slice‑granularity prioritization. Additional mechanisms include Slice recycle bins, Master metadata backup, and DataNode metadata reconstruction.

Scalability and Resource Management

ARIES scales to exabyte clusters by managing metadata at the Volume granularity, keeping metadata size in gigabytes. It supports flexible EC configurations (e.g., 1.2‑replica), dynamic balancing based on capacity utilization or replica rules, and seamless addition of disks, machines, or clusters. Resource pools are mapped via Table Spaces, and the system handles diverse storage media (HDD, SMR, SSD, tape) across hierarchical physical layers (AZ → DC → rack → machine → disk).

Cost‑Performance Trade‑offs

ARIES achieves space savings through low‑overhead EC (as low as 1.2× replication) and compression. Hardware costs are minimized by using high‑capacity, low‑cost disks and tape libraries. Trade‑offs are applied per workload: larger EC parameters reduce storage cost but increase latency; for hot small objects, higher replication improves performance; tape offers cheap archival storage with higher access latency.

Monitoring and Operations

Comprehensive monitoring covers resource usage, service capacity, and reliability metrics, achieving five‑nine availability for Put/Get/Remove operations. Operational practices emphasize precise alerting, automated runbooks, pre‑approved procedures, and a strong DevOps/SRE culture to ensure rapid incident response and continuous improvement.

Experience and Reflections

The team shares insights on demand management, long‑term design foresight, the importance of fault‑handling parity with functionality, architectural evolution, ID design (expanding Slice ID to 128 bits), testability, cross‑team collaboration, formal verification, and the principle that every anomaly, however small, warrants thorough investigation.

References

Links to related Baidu Canghai storage talks and articles are provided at the end of the original document.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems High Availability Resource Management cloud storage erasure-coding exabyte scale

Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.