Databases 19 min read

How Baidu’s TafDB Achieves Trillion‑Scale Metadata Storage with Near‑Zero Latency

This article explores the design and engineering of Baidu’s TafDB, a distributed metadata database that powers cloud object and file storage, detailing its architecture, namespace evolution, transaction optimizations, garbage collection strategies, and clock mechanisms that enable trillion‑scale metadata and millions of QPS.

Baidu Intelligent Cloud Tech Hub

Sep 23, 2022

How Baidu’s TafDB Achieves Trillion‑Scale Metadata Storage with Near‑Zero Latency

1. Metadata Plane Technology Evolution

Object storage and file storage metadata planes are essentially namespaces, divided into hierarchical and flat namespaces. Hierarchical namespaces support directory‑tree semantics for file systems, while flat namespaces store object block location lists for object storage.

2. Hierarchical Namespace Evolution

Hierarchical namespaces maintain file attributes and directory structures, supporting operations such as create, lookup, delete, and rename. Industry solutions include:

Single‑node architecture: all directory trees reside in memory on one node, low latency but limited to ~1 billion files (e.g., HDFS).

Sub‑tree partitioning: directory trees are split into sub‑trees placed on different meta nodes; suffers hotspot and rename cross‑tree limitations (e.g., HDFS Federation, CephFS, IndexFS).

Distributed‑transaction database: a semantic layer translates namespace operations into database transactions; each inode maps to a database row, offering unlimited scale (e.g., Facebook Tectonic).

3. Flat Namespace Evolution

Flat namespaces store lists of block locations for each object. Early designs used database middleware, which limited scalability and cross‑database transaction support. Modern solutions rely on distributed‑transaction databases such as AWS DynamoDB and Google Spanner.

4. Metadata Plane Technology Selection

After evaluating distributed‑transaction databases, Spanner’s architecture best meets requirements for high performance, strong consistency, and scalability. Other candidates (Calvin, FoundationDB) lack needed features or introduce latency.

5. Baidu Cloud’s Metadata Store TafDB

TafDB is a distributed database designed for metadata workloads, powering Baidu Intelligent Cloud Object Storage (BOS) and File Storage (CFS) with trillion‑scale metadata and tens of millions of QPS.

5.1 System Architecture

TafDB builds on RocksDB for local storage and uses a Multi‑Raft protocol for replica consistency. Components:

BE: stores data in tablets; tablets form Raft groups for high availability.

Master: manages metadata such as partitioning and capacity, also Raft‑based.

Proxy: stateless front‑end handling SQL parsing, transaction coordination, and query planning.

TimeService: provides a global monotonic clock, gradually replaced by a distributed clock solution.

5.2 System Features

Full‑featured: global ordering, distributed transactions, secondary indexes, distributed queries, backup, CDC.

High performance: 2× faster than open‑source alternatives for metadata read/write.

Extreme scalability: supports trillion‑scale metadata and exabyte‑scale clusters.

6. Engineering Challenges

Building a feature‑complete distributed transaction database with high performance and scalability presents three main challenges.

6.1 Reducing Distributed Transaction Overhead while Preserving ACID

Cross‑shard transactions incur costly two‑phase commit. TafDB mitigates this by:

Hierarchical namespace: custom split strategy keeps directory metadata on the same shard; business schema adjustments ensure operations stay within a single shard.

Flat namespace: asynchronous secondary‑index writes move most writes to a single shard, eliminating the need for cross‑shard commits.

These optimizations convert most two‑phase commits to one‑phase commits.

6.2 Maintaining High Write Performance while Supporting Range Queries

RocksDB’s LSM‑tree uses tombstones for deletions, leading to garbage data that degrades range‑query performance. TafDB addresses this by:

Scale‑up: multi‑level, feature‑aware GC that tailors strategies per shard and reduces scanning overhead.

Scale‑out: disperses garbage across multiple shards and RocksDB instances, migrating hot shards when needed.

Flow control & feedback: throttles high‑cost list requests and triggers targeted GC when scan latency spikes.

6.3 Eliminating Single Points in Data Flow for Extreme Scalability and Availability

Traditional global timestamp services (TSO) become bottlenecks at high QPS. TafDB adopts a distributed clock (TafDB Clock) where each node provides a local clock; cross‑shard transactions use broadcast to maintain causal order, removing the central clock bottleneck.

7. Impact and Applications

TafDB delivers a unified, high‑performance, and highly scalable metadata platform for Baidu’s storage products.

7.1 File System Namespace (CFS)

Provides linear scalability with 2 ms write latency and sub‑100 µs read latency, supporting both traditional workloads and AI‑driven massive‑file scenarios.

7.2 Object Storage Namespace (BOS)

Increases single‑bucket capacity from hundreds of billions to trillions of objects, reduces small‑file latency by 42 %, greatly improving image upload/download experiences.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

scalability Metadata cloud storage transaction optimization

Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.