How Baidu’s HTAP Table Storage Achieves Massive IO Gains and Faster Development
Baidu’s Search Content Storage team built an HTAP table storage system and a serverless compute‑scheduling architecture that separates OLTP and OLAP workloads, delivering up to 200 GB/s peak IO, reducing storage cost by 75 %, and enabling SQL‑style task development with native FaaS functions.
The article describes Baidu’s HTAP (Hybrid Transactional/Analytical Processing) table storage system and its compute‑scheduling framework, built to satisfy massive OLAP workloads on petabyte‑scale data.
Project Background and Goals
Provide ultra‑high I/O access for petabyte‑scale storage (target 34 GB/s average, 200 GB/s peak).
Enable rapid development and deployment of data‑processing tasks.
Key Challenges
Storage‑I/O bottleneck in the original OLTP‑oriented Table engine.
Mismatch between OLTP‑focused storage layout and OLAP query patterns.
Need for a task‑development model that avoids large intermediate data footprints.
Overall Architecture
The solution separates compute from storage into two independent engines:
Neptune – an OLTP storage engine that continues to serve low‑latency transactional workloads.
Saturn – an OLAP storage engine designed with a serverless model for high‑throughput analytical queries.
Data synchronization between Neptune and Saturn is performed via hard‑link files, allowing fast, low‑cost versioned updates.
Neptune (OLTP Engine)
Neptune supports four primitive operations—write, delete, read, and scan—each routed through a RegionMapper that abstracts partitioning.
Partition Types
Index partitions reduce random‑I/O amplification by keeping key‑to‑region mappings.
Data partitions contain multiple locality groups; each group stores rows with the same locality.
Write flow : Row data is serialized into RawData, a region index entry is generated, and both are atomically committed.
Delete flow : The key’s partition is resolved via the index, then the corresponding data entry is removed.
Read flow : The index is consulted; if the key exists, the data partition is fetched.
Scan flow : The client specifies target partitions and column families; RegionMapper selects appropriate physical handles for scanning.
Saturn (OLAP Engine)
Saturn is built on three layers:
Pluggable File System – supports Baidu’s AFS, local FS, or future file systems.
Table abstraction – a Table consists of multiple slices ; each slice can follow either a hash‑order or a global‑order model.
SDK – provides Seek and Scan APIs that behave like typical column‑store interfaces.
Data ingestion supports two modes:
Full‑copy build : a complete dump of a Table replaces each slice with a new version.
Incremental merge : snapshots of changed data are transformed into Saturn‑specific SST files and ingested via an Ingest operation, while major compaction is deferred to avoid interference.
Storage Engine Optimizations
Data‑row partitioning : rows are partitioned by a user‑defined key, reducing I/O amplification during row‑level filtering (similar to ClickHouse and Iceberg).
Incremental data filtering : compaction timing and filter logic are tuned so that scans only touch recent changed files, cutting unnecessary I/O.
Dynamic column layout : during compaction, frequently accessed columns are colocated in the same physical storage block, minimizing column‑wise I/O amplification for analytical queries.
Compute and Scheduling Architecture
A custom query language, KQL (SQL‑like), allows users to embed native FaaS functions directly in queries. An example KQL script:
function classify = { # define a Python FaaS function
def classify(cbytes, ids):
unique_ids = set(ids)
classify = int.from_bytes(cbytes, byteorder='little', signed=False)
while classify != 0:
tmp = classify & 0xFF
if tmp in unique_ids:
return True
classify = classify >> 8
return False
}
declare ids = [2, 8];
declare ts_end = function@gettimeofday_us();
declare ts_beg = @ts_end - 24 * 3600 * 1000000;
select * from my_table region in timeliness
where timestamp between @ts_beg and @ts_end
filter by function@classify(@cf0:types, @ids)
convert by json outlet by row;
--multi_output=true;The execution pipeline consists of three layers:
Parsing layer – converts KQL into an executable plan and stores it.
Scheduling layer – dispatches plans to execution containers, monitors status, and handles retries.
Execution containers – either Baidu’s proprietary Pioneer platform or an EMR‑style MapReduce cluster; containers run the user‑defined FaaS functions.
Technical and Economic Metrics
Storage cost per PB: -74.6 %
I/O throughput: +495 %
I/O amplification (actual / effective): -70 %
Data‑access latency (days per round): +64.2 %
Development efficiency (person‑days): +83 %
Main Innovations
Compute‑storage separation using low‑cost AFS for OLAP data.
Serverless OLAP design that adapts to intermittent, high‑throughput workloads.
Row‑level partitioning to cut I/O amplification.
Incremental data filtering to avoid full scans of unchanged data.
Dynamic column layout that colocates hot columns during compaction.
KQL‑based SQL task definition with native FaaS support.
The combined architecture delivers substantial gains in I/O capacity, cost efficiency, and developer productivity for large‑scale storage‑driven analytics.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
