Cloud Computing 14 min read

How Baidu’s Next‑Gen Metadata Engine Powers Trillion‑Object Object Storage

This article details Baidu's Cloud Storage (BOS) architecture, the challenges of its legacy metadata system, and the design of a new generation metadata engine that enables trillion‑object buckets, million‑QPS performance, hierarchical namespaces, and intelligent lifecycle management.

Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
How Baidu’s Next‑Gen Metadata Engine Powers Trillion‑Object Object Storage

1. Object Storage BOS Overview

Baidu Cloud Storage (BOS) provides stable, secure, and efficient storage for any data type, supporting files up to 48.8 TB and offering a standard RESTful HTTP API. Its architecture consists of four layers: load balancing, Web service, storage, and hardware.

The top layer is a self‑developed four‑layer load balancer that operates in cluster mode without a single point of failure, handling attack protection and load distribution.

The Web Service layer is stateless and horizontally scalable, handling HTTP parsing, flow control, authentication, data chunking, and stream control.

The storage layer comprises two independent distributed systems: a NewSQL‑based metadata store and a data store that supports erasure coding, ultra‑low replica counts, and EB‑scale storage, achieving up to 200 MB/s write throughput with a minimum replica factor of 1.1.

The hardware layer supports SSD, HDD, and tape, offering six storage classes (standard‑multi‑AZ, standard, low‑frequency‑multi‑AZ, low‑frequency, cold, and archive), with cold storage being the industry’s first archive‑type storage that requires no pre‑retrieval.

2. Metadata Architecture and Challenges

The previous metadata system hashed bucket IDs to a few DataNodes, leading to several issues:

Limited bucket capacity (hundreds of billions at best).

Throughput capped at tens of thousands QPS per bucket.

Severe data skew across nodes.

Performance trade‑off between high throughput and List‑objects latency.

Poor transaction support, requiring external middleware for operations like object rename.

To address these, a new metadata system was launched in early 2020 with goals of supporting trillion‑object buckets, million‑QPS latency‑level APIs, operational friendliness, balanced data distribution, easy scaling, and database‑like features such as transactions, secondary indexes, backup, and CDC.

Key improvements include:

Finer‑grained data management using sub‑4 GB shards for easier migration and balancing.

Raft‑based replication and leader election for high reliability.

Master‑managed shard distribution with range partitioning for global ordering.

MVCC‑based transaction support enabling efficient object rename.

Robust backup, streaming export, and real‑time incremental sync supporting daily export of billions of records.

3. Solving Core Scalability Issues

Scalability hinges on three variables: per‑node data capacity, data distribution across nodes, and cluster size. Optimizations include:

RocksDB‑based engine with separated log and data I/O, AEP media for logs, aggressive compression, and low‑overhead compaction, allowing a single node to handle ~10 billion entries.

Composite partitioning (hash then range) and automatic shard split/merge for balanced distribution.

Heartbeat aggregation to node‑level, reducing Master load for clusters of >1 000 nodes.

Distributed Time Service and client‑side routing cache to further lessen Master pressure.

These measures enable a single cluster of over 1 000 machines to support trillion‑object buckets.

4. Enabling Product Feature Upgrades

The new metadata system powers two major BOS enhancements:

Hierarchical Namespace : Moves from a flat namespace to a true directory tree, allowing folder moves in a single RPC with millisecond latency and supporting efficient lookups (≈2 ms) via batch queries and caching.

Intelligent Lifecycle : Introduces six storage tiers with automatic demotion based on access patterns. Offline processing merges massive access logs to compute last‑access timestamps, enabling cost‑effective tiering without per‑request metadata writes.

These upgrades significantly improve performance, reduce storage costs, and expand BOS capabilities for big data, AI training, and archival workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed Systemsmetadatahigh performancecloud storageobject storageBaidu
Baidu Intelligent Cloud Tech Hub
Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.