Design and Challenges of TafDB: A Scalable Metadata Storage Engine for Cloud Data Lakes
TafDB, Baidu’s Spanner‑like distributed transaction database built on RocksDB and Multi‑Raft, provides a virtually unlimited metadata layer for cloud data lakes by unifying hierarchical and flat namespaces, minimizing cross‑shard transaction overhead, handling garbage collection, and employing a distributed clock, thus delivering trillion‑scale metadata capacity and tens of millions of QPS with low latency.
Massive data growth imposes extremely high requirements on the scalability of data lake storage. The metadata layer, as one of the core components of cloud storage, directly determines the extensibility of the whole system.
This article, the second in a series about data lakes, reveals the secrets of the metadata storage foundation and explains how to design a "virtually unlimited" capacity storage layer.
With the rapid development of mobile Internet, IoT, and AI computing, global data volume is expected to increase from 33 ZB in 2018 to 175 ZB by 2025, challenging the scalability of cloud storage systems.
Cloud storage consists of a data plane (user data) and a metadata plane (meta information). As user data and access volume grow, the number of metadata entries and QPS increase, making the scalability of the metadata plane a bottleneck for the entire system.
TafDB, the unified metadata foundation of Baidu Canghai·Storage, supports Baidu Intelligent Cloud Object Storage (BOS) and File System (CFS), providing trillion‑scale metadata capacity and tens of millions of QPS, meeting the scalability and performance demands of massive data lakes.
1. Evolution of Metadata Plane Technology
Metadata for object and file storage is essentially a Namespace, which can be hierarchical or flat.
File storage requires a hierarchical Namespace to support directory‑tree semantics, while object storage traditionally uses a flat Namespace because objects lack folder concepts. Recently, object storage has begun to support hierarchical Namespace to be compatible with HDFS semantics.
2. Hierarchical Namespace Evolution
Hierarchical Namespaces maintain file attributes, directory trees, and support operations such as create, lookup, delete, and rename.
Typical architectures include:
Single‑machine architecture: all directory trees reside in memory on one machine, offering low latency but limited to ~1 billion files (e.g., HDFS).
Sub‑tree partitioning: splits the directory tree into sub‑trees deployed on different meta nodes. This can cause hotspots and makes cross‑sub‑tree rename difficult (e.g., HDFS Federation, CephFS, IndexFS).
Distributed‑transaction database: a metadata semantics layer translates namespace operations into database transactions. Each inode maps to a row in a distributed DB, enabling unlimited file counts (e.g., Facebook Tectonic).
3. Flat Namespace Evolution
Flat Namespaces store the list of block locations for each object. Early solutions used database middleware, which suffered from poor scalability and weak cross‑database transaction support. Modern solutions rely on distributed transaction databases (e.g., AWS DynamoDB, Google Spanner) to overcome these limitations.
Using distributed transaction databases solves the scalability problem and supports both hierarchical and flat namespaces at trillion‑scale.
4. Technology Selection for the Metadata Foundation
After evaluating distributed transaction databases, only Spanner’s architecture met our requirements for high performance, strong consistency, and automatic scaling. Consequently, Baidu Canghai·Storage decided to develop a Spanner‑like distributed transaction database as its own metadata foundation.
5. Baidu Intelligent Cloud Metadata Foundation – TafDB
TafDB is a distributed database designed for metadata scenarios, powering BOS and CFS with trillion‑scale metadata and tens of millions of QPS.
5.1 System Architecture
TafDB is built on RocksDB for single‑node storage and uses a Multi‑Raft protocol for replica consistency. Its components are:
BE (Backend): stores data in Tablets; multiple Tablets form a Raft group for high availability.
Master: manages metadata such as partitioning, capacity, and balancing, also Raft‑based.
Proxy: stateless front‑end for SQL parsing, transaction coordination, and query planning.
TimeService: provides a global monotonic clock (being replaced by a new distributed clock solution).
5.2 System Features
Full‑featured: global ordering, distributed transactions, secondary indexes, distributed queries, backup, CDC.
High performance: metadata read/write performance >2× that of open‑source alternatives.
Strong scalability: supports trillion‑scale metadata and exabyte‑scale storage per cluster.
6. Engineering Challenges
Building a feature‑complete distributed transaction database with high performance and scalability presents several challenges.
6.1 Reducing Distributed Transaction Overhead While Preserving ACID
Metadata operations often involve cross‑shard transactions, which incur costly two‑phase commits (2PC). TafDB mitigates this by:
Hierarchical Namespace: custom split strategy keeps same‑level directory metadata on a single shard, allowing most operations to be single‑shard.
Flat Namespace: asynchronous secondary index writes reduce the need for cross‑shard commits.
These optimizations convert most 2PC transactions into single‑phase commits, eliminating most cross‑shard overhead.
6.2 Maintaining High‑Performance Writes While Preserving Range Query Performance
TafDB uses RocksDB (LSM‑tree) where deletions are represented as tombstones, leading to garbage data that can degrade range query performance.
Solutions include:
Scale‑up: multi‑level, feature‑aware GC that adapts strategies per shard and reduces unnecessary scans.
Scale‑out: distribute garbage across multiple RocksDB instances and shards, migrating hot shards when needed.
Flow control & feedback: monitor request latency and garbage volume, applying back‑pressure and targeted recovery.
6.3 Eliminating Single Points in the Data Flow for Extreme Scalability and Availability
Traditional global timestamp services (TSO) become bottlenecks at high QPS. TafDB adopts a distributed clock (TafDB Clock) where each storage node maintains a local clock; single‑shard transactions use the local clock, while cross‑shard transactions achieve causal ordering via lightweight broadcast.
This design removes the central clock bottleneck without adding significant overhead.
7. TafDB Application Results
The combined design and optimizations deliver a fully‑featured, high‑performance, and highly scalable metadata storage system that unifies Baidu Canghai·Storage’s metadata foundation.
7.1 File System Namespace Application
Baibu Canghai·File Storage (CFS) built on TafDB achieves linear scalability with write latency of 2 ms and read latency in the sub‑100 µs range, supporting both traditional workloads and AI‑driven massive‑file scenarios.
7.2 Object Storage Namespace Application
Baibu Canghai·Object Storage (BOS) using TafDB expands single‑bucket capacity from hundreds of billions to trillions, reduces small‑file latency by 42 %, and greatly improves user experience for image upload/download.
Baidu Geek Talk
Follow us to discover more Baidu tech insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.