How Baidu’s Next‑Gen Metadata Engine Powers Trillion‑Object Object Storage
This article details Baidu's Cloud Storage (BOS) architecture, the challenges of its legacy metadata system, and the design of a new generation metadata engine that enables trillion‑object buckets, million‑QPS performance, hierarchical namespaces, and intelligent lifecycle management.
1. Object Storage BOS Overview
Baidu Cloud Storage (BOS) provides stable, secure, and efficient storage for any data type, supporting files up to 48.8 TB and offering a standard RESTful HTTP API. Its architecture consists of four layers: load balancing, Web service, storage, and hardware.
The top layer is a self‑developed four‑layer load balancer that operates in cluster mode without a single point of failure, handling attack protection and load distribution.
The Web Service layer is stateless and horizontally scalable, handling HTTP parsing, flow control, authentication, data chunking, and stream control.
The storage layer comprises two independent distributed systems: a NewSQL‑based metadata store and a data store that supports erasure coding, ultra‑low replica counts, and EB‑scale storage, achieving up to 200 MB/s write throughput with a minimum replica factor of 1.1.
The hardware layer supports SSD, HDD, and tape, offering six storage classes (standard‑multi‑AZ, standard, low‑frequency‑multi‑AZ, low‑frequency, cold, and archive), with cold storage being the industry’s first archive‑type storage that requires no pre‑retrieval.
2. Metadata Architecture and Challenges
The previous metadata system hashed bucket IDs to a few DataNodes, leading to several issues:
Limited bucket capacity (hundreds of billions at best).
Throughput capped at tens of thousands QPS per bucket.
Severe data skew across nodes.
Performance trade‑off between high throughput and List‑objects latency.
Poor transaction support, requiring external middleware for operations like object rename.
To address these, a new metadata system was launched in early 2020 with goals of supporting trillion‑object buckets, million‑QPS latency‑level APIs, operational friendliness, balanced data distribution, easy scaling, and database‑like features such as transactions, secondary indexes, backup, and CDC.
Key improvements include:
Finer‑grained data management using sub‑4 GB shards for easier migration and balancing.
Raft‑based replication and leader election for high reliability.
Master‑managed shard distribution with range partitioning for global ordering.
MVCC‑based transaction support enabling efficient object rename.
Robust backup, streaming export, and real‑time incremental sync supporting daily export of billions of records.
3. Solving Core Scalability Issues
Scalability hinges on three variables: per‑node data capacity, data distribution across nodes, and cluster size. Optimizations include:
RocksDB‑based engine with separated log and data I/O, AEP media for logs, aggressive compression, and low‑overhead compaction, allowing a single node to handle ~10 billion entries.
Composite partitioning (hash then range) and automatic shard split/merge for balanced distribution.
Heartbeat aggregation to node‑level, reducing Master load for clusters of >1 000 nodes.
Distributed Time Service and client‑side routing cache to further lessen Master pressure.
These measures enable a single cluster of over 1 000 machines to support trillion‑object buckets.
4. Enabling Product Feature Upgrades
The new metadata system powers two major BOS enhancements:
Hierarchical Namespace : Moves from a flat namespace to a true directory tree, allowing folder moves in a single RPC with millisecond latency and supporting efficient lookups (≈2 ms) via batch queries and caching.
Intelligent Lifecycle : Introduces six storage tiers with automatic demotion based on access patterns. Offline processing merges massive access logs to compute last‑access timestamps, enabling cost‑effective tiering without per‑request metadata writes.
These upgrades significantly improve performance, reduce storage costs, and expand BOS capabilities for big data, AI training, and archival workloads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
