Databases 27 min read

How Baidu’s BaikalDB Redefined Distributed Storage for Massive Ad Platforms

This article analyzes the evolution of Baidu's advertising data storage, detailing the business-driven requirements, the design and development of the BaikalDB distributed database, its architecture across storage, compute and scheduling layers, key features such as Raft replication and multi‑index support, and the lessons learned for building cloud‑native, high‑performance databases.

Baidu Geek Talk

Aug 4, 2021

How Baidu’s BaikalDB Redefined Distributed Storage for Massive Ad Platforms

1. Business Requirements for Data Storage

Commercial advertising systems need a storage layer that is highly reliable, cost‑effective, and delivers millisecond‑level read/write performance while supporting diverse workloads such as OLTP transactions, OLAP analytics, high‑QPS queries, KV lookups, and fuzzy searches.

To meet these demands, traditional stacks would require a combination of MySQL, Redis, an OLAP warehouse like Palo/Doris, and a full‑text engine such as Elasticsearch, leading to complex operations and high resource duplication.

Key Storage Demands

Transactional (OLTP) workloads for ad delivery and bidding.

Analytical (OLAP) workloads for performance reporting.

High‑QPS key‑value queries (e.g., account structures, permissions).

Exact‑match KV lookups for keyword‑id mapping.

Fuzzy searches for material lists.

Desired System Qualities

Stability and high availability.

Strong data consistency.

Low total cost of ownership.

Fast, millisecond‑scale response times.

2. BaikalDB Development History

The advertising inventory (ad library) originally ran on a single‑machine MySQL cluster, then migrated to a sharded MySQL setup with up to 33 partitions, each with 1 master and 11 replicas, storing tens of terabytes and handling billions of daily PVs. As data grew, further splitting became increasingly costly and disruptive.

Inspired by cloud‑native databases such as Aurora, PolarDB, and TiDB, Baidu evaluated three paths in 2017:

Deeply customize MySQL on a distributed file system (e.g., Aurora‑style).

Adopt an external distributed database (e.g., TiDB, CockroachDB) – still immature at the time.

Build a new HTAP (Hybrid Transactional/Analytical Processing) system from scratch.

Given the team’s expertise (C++ developers, SQL‑proxy experience, RocksDB knowledge) and the availability of Baidu’s internal RPC (brpc) and consensus (braft) frameworks, the decision was to create BaikalDB, a MySQL‑compatible, cloud‑native distributed database.

Core Goals of BaikalDB

Flexible cloud deployment (container‑friendly, linear scaling).

One‑stop storage‑compute capabilities (OLTP, OLAP, full‑text, high‑performance KV).

MySQL protocol compatibility for low learning curve.

3. Key Design and Practices

3.1 Storage Layer

BaikalDB uses RocksDB as the underlying disk‑based KV engine. Data is partitioned into Regions (the smallest management unit). Each table is split into multiple Regions, which are distributed across nodes. BaikalDB adopts range‑based sharding to simplify splits and avoid hotspot hotspots.

Key‑value mapping includes:

Primary index : stores the full row (protobuf‑encoded) keyed by region_id, index_id, and the clustered primary key.

Local secondary index : resides on the same node as the primary data, enabling fast lookups without distributed transactions.

Global secondary index : a separate table that requires distributed transaction support; added later after Raft‑based replication was stable.

Full‑text index : inverted list keyed by region_id, index_id, and token, with values pointing to primary keys.

Replication is achieved via a three‑node Raft group per Region, ensuring strong consistency. Leader election, log replication, and membership changes are handled by Baidu’s braft library. Reads and writes are directed to the Raft leader for strict consistency.

3.2 Compute Layer

SQL statements are parsed and transformed into a distributed execution plan using a volcano‑style operator model (open/next/close). The planner combines rule‑based (RBO) and cost‑based (CBO) optimization, leveraging statistics and a cost model to choose the lowest‑cost plan. Operators can push down filters to BaikalStore, reducing data transfer.

While not a full MPP engine, BaikalDB performs partial aggregation on storage nodes and final aggregation on a BaikalDB node, which is sufficient for OLTP‑centric workloads with limited result sizes.

3.3 Scheduling Layer

BaikalMeta acts as the master scheduler. Stores send periodic heartbeats; Meta evaluates leader distribution and replica (peer) balance, then issues rebalance commands. Leader balancing evens out read/write load, while peer balancing spreads replicas across machines and zones to improve fault tolerance.

Region splits are triggered when size exceeds a threshold; new Regions are created via range‑based splitting and automatically scheduled for balanced placement.

4. Summary

Starting from a monolithic MySQL deployment, Baidu iteratively built BaikalDB to unify all storage solutions for its advertising platform, achieving PB‑scale capacity, high availability, and multi‑model query support with a relatively small engineering effort. The design emphasizes three pillars—storage (RocksDB + Region + Raft), compute (SQL + RBO/CBO), and scheduling (Meta‑driven balancing)—providing a practical reference for building cloud‑native, HTAP‑capable databases.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native distributed database storage architecture RocksDB BaikalDB SQL Optimizer Raft replication

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.