How Baidu’s CDS Uses Erasure Coding to Cut Storage Costs and I/O Amplification
This article explains Baidu Intelligent Cloud's block storage (CDS) architecture, comparing fault‑tolerance methods, detailing the challenges of large‑scale erasure‑coded storage, and describing Baidu's two‑layer append‑engine solution that reduces I/O amplification while keeping costs low.
1. Data Fault Tolerance Comparison
Data fault tolerance differs between single‑node and distributed systems. On a single node, RAID (typically RAID5) or software RAID is used. In distributed environments, the common approaches are multi‑replica (using consensus protocols like Paxos or Raft) and distributed erasure coding, which is essentially a distributed RAID.
Multi‑replica stores identical copies on different machines; with N replicas, up to N‑1 failures can be tolerated. Erasure coding splits data into K equal shards and creates M parity shards (e.g., Reed‑Solomon). The K+M shards are distributed across machines, tolerating up to M failures. Compared to three‑replica storage, erasure coding reduces storage cost to roughly 1.x× instead of 3×, though it incurs additional CPU for encoding/decoding and I/O overhead.
2. Large‑Scale Block Storage EC Challenges
Erasure coding introduces high modification costs because updating a small write requires reading the original data, recomputing parity, and writing back, leading to read‑modify‑write I/O amplification. Small I/O (e.g., 4 KB) is especially problematic for EC, which prefers larger, aligned I/O.
Block storage workloads consist of many small writes, while object storage typically writes whole objects, avoiding EC‑related amplification. To handle small writes efficiently, Baidu proposes a three‑replica cache layer for small I/O and direct EC for large I/O.
3. Baidu’s Implementation Solution
Baidu’s CDS builds an index layer that points to EC‑encoded data; modifications are performed by appending new EC‑encoded segments rather than updating in place. Each write creates a new slice, and the index is updated to reference the latest slice, forming an append‑only engine.
The underlying EC system, Aries, provides the storage substrate. For small writes, data is first cached in a three‑replica layer; once it reaches a threshold (e.g., 1 GB), it is encoded with EC and stored. This hybrid approach avoids excessive I/O amplification for small writes while keeping large writes efficiently encoded.
Both logical and physical layers use append‑only storage, simplifying space allocation and improving performance on SSD/HDD. Compaction is performed using a cost‑benefit algorithm that considers both hole ratio and segment age, selecting segments that balance space reclamation and write amplification.
Overall, the two‑layer append architecture, size‑aware EC handling, and intelligent compaction strategy enable Baidu’s block storage to achieve low cost, low I/O amplification, and high performance at massive scale.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
