Industry Insights 20 min read

How Baidu’s HTAP Table Storage Achieves Massive IO Gains and Faster Development

Baidu’s Search Content Storage team built an HTAP table storage system and a serverless compute‑scheduling architecture that separates OLTP and OLAP workloads, delivering up to 200 GB/s peak IO, reducing storage cost by 75 %, and enabling SQL‑style task development with native FaaS functions.

Baidu Geek Talk

Dec 30, 2024

How Baidu’s HTAP Table Storage Achieves Massive IO Gains and Faster Development

The article describes Baidu’s HTAP (Hybrid Transactional/Analytical Processing) table storage system and its compute‑scheduling framework, built to satisfy massive OLAP workloads on petabyte‑scale data.

Project Background and Goals

Provide ultra‑high I/O access for petabyte‑scale storage (target 34 GB/s average, 200 GB/s peak).

Enable rapid development and deployment of data‑processing tasks.

Key Challenges

Storage‑I/O bottleneck in the original OLTP‑oriented Table engine.

Mismatch between OLTP‑focused storage layout and OLAP query patterns.

Need for a task‑development model that avoids large intermediate data footprints.

Overall Architecture

The solution separates compute from storage into two independent engines:

Neptune – an OLTP storage engine that continues to serve low‑latency transactional workloads.

Saturn – an OLAP storage engine designed with a serverless model for high‑throughput analytical queries.

Data synchronization between Neptune and Saturn is performed via hard‑link files, allowing fast, low‑cost versioned updates.

Neptune (OLTP Engine)

Neptune supports four primitive operations—write, delete, read, and scan—each routed through a RegionMapper that abstracts partitioning.

Partition Types

Index partitions reduce random‑I/O amplification by keeping key‑to‑region mappings.

Data partitions contain multiple locality groups; each group stores rows with the same locality.

Write flow : Row data is serialized into RawData, a region index entry is generated, and both are atomically committed.

Delete flow : The key’s partition is resolved via the index, then the corresponding data entry is removed.

Read flow : The index is consulted; if the key exists, the data partition is fetched.

Scan flow : The client specifies target partitions and column families; RegionMapper selects appropriate physical handles for scanning.

Saturn (OLAP Engine)

Saturn is built on three layers:

Pluggable File System – supports Baidu’s AFS, local FS, or future file systems.

Table abstraction – a Table consists of multiple slices ; each slice can follow either a hash‑order or a global‑order model.

SDK – provides Seek and Scan APIs that behave like typical column‑store interfaces.

Data ingestion supports two modes:

Full‑copy build : a complete dump of a Table replaces each slice with a new version.

Incremental merge : snapshots of changed data are transformed into Saturn‑specific SST files and ingested via an Ingest operation, while major compaction is deferred to avoid interference.

Storage Engine Optimizations

Data‑row partitioning : rows are partitioned by a user‑defined key, reducing I/O amplification during row‑level filtering (similar to ClickHouse and Iceberg).

Incremental data filtering : compaction timing and filter logic are tuned so that scans only touch recent changed files, cutting unnecessary I/O.

Dynamic column layout : during compaction, frequently accessed columns are colocated in the same physical storage block, minimizing column‑wise I/O amplification for analytical queries.

Compute and Scheduling Architecture

A custom query language, KQL (SQL‑like), allows users to embed native FaaS functions directly in queries. An example KQL script:

function classify = { # define a Python FaaS function
    def classify(cbytes, ids):
        unique_ids = set(ids)
        classify = int.from_bytes(cbytes, byteorder='little', signed=False)
        while classify != 0:
            tmp = classify & 0xFF
            if tmp in unique_ids:
                return True
            classify = classify >> 8
        return False
}

declare ids = [2, 8];
declare ts_end = function@gettimeofday_us();
declare ts_beg = @ts_end - 24 * 3600 * 1000000;

select * from my_table region in timeliness
where timestamp between @ts_beg and @ts_end
    filter by function@classify(@cf0:types, @ids)
    convert by json outlet by row;
--multi_output=true;

The execution pipeline consists of three layers:

Parsing layer – converts KQL into an executable plan and stores it.

Scheduling layer – dispatches plans to execution containers, monitors status, and handles retries.

Execution containers – either Baidu’s proprietary Pioneer platform or an EMR‑style MapReduce cluster; containers run the user‑defined FaaS functions.

Technical and Economic Metrics

Storage cost per PB: -74.6 %

I/O throughput: +495 %

I/O amplification (actual / effective): -70 %

Data‑access latency (days per round): +64.2 %

Development efficiency (person‑days): +83 %

Main Innovations

Compute‑storage separation using low‑cost AFS for OLAP data.

Serverless OLAP design that adapts to intermittent, high‑throughput workloads.

Row‑level partitioning to cut I/O amplification.

Incremental data filtering to avoid full scans of unchanged data.

Dynamic column layout that colocates hot columns during compaction.

KQL‑based SQL task definition with native FaaS support.

The combined architecture delivers substantial gains in I/O capacity, cost efficiency, and developer productivity for large‑scale storage‑driven analytics.

serverless Big Data HTAP Compute Scheduling IO optimization Table Storage KQL

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.