How Mantle Redefined Cloud Object Storage Metadata for Billion‑File Scale
This article recounts how Baidu's storage team tackled the performance and scalability limits of traditional object storage by redesigning metadata handling with the Mantle and MantleX architectures, introducing a centralized IndexNode, strong consistency, delta‑record writes, and a seamless single‑node to distributed transition for massive file systems.
Background
In the era of AI and big data, object storage faces three core challenges compared with HDFS: high read‑dir cost, expensive rename operations, and poor performance on small‑scale workloads due to loss of metadata locality.
Industry Exploration
Three generations of distributed namespace designs were examined: single‑node metadata (HDFS), subtree partitioning (CephFS), and directory‑level partitioning (Tectonic). Each improves scalability at the expense of locality, leading to severe performance penalties in cloud object storage.
Mantle Core Architecture Evolution
The team introduced a centralized IndexNode that stores all directory metadata, paired with a scalable MetaStore built on TafDB. Synchronous two‑phase‑commit updates guarantee strong consistency, allowing read‑only operations to be offloaded to MetaStore while keeping path resolution a single RPC.
Key Design Decisions
Reject subtree partitioning due to hotspot management complexity and incompatibility with existing TafDB.
Accept the trade‑off that directory modifications (<10% of operations) incur a 2PC cost to gain fast path lookups.
Optimization 1 – IndexNode Performance
A hybrid storage model caches the stable top‑level directory paths in memory ( TopDirPathCache) while keeping the rest in a PID+Name key‑value store. An in‑memory MVCC stores recent versions for five seconds, reducing read‑amplification. Follower reads distribute lookup load across Raft replicas.
Optimization 2 – Delta Record Mechanism
To eliminate write‑conflict bottlenecks, Mantle replaces in‑place updates of directory attributes with append‑only delta records (e.g., +1, -1). A background thread merges deltas into the base record, turning thousands of conflicting 2PC writes into lock‑free appends and boosting concurrent mkdir/rmdir throughput by over 100×.
MantleX – Scale‑Adaptive Architecture
For workloads where total metadata fits on a single node, MantleX stores both Index and Meta tables in the same physical tablet, allowing 1PC transactions and leveraging TafDB coprocessors to collapse multi‑step operations into a single RPC. When the dataset grows, the tablet can be split, seamlessly transitioning to the distributed Mantle mode without downtime.
Future Directions
Replace RocksDB with a lightweight hash engine for pure directory lookups.
Introduce a DSL sandbox for safe, hot‑updatable directory logic.
Scale out IndexNode horizontally and continue refining delta‑record merging.
Acknowledgments
The authors thank the Baidu Canghai Storage team, collaborators from USTC and Tsinghua, and product‑market colleagues for their support in turning this research into a production‑grade storage solution.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
