How Baidu Built an HTAP Table Storage System to Tackle Massive Data Analytics
This article examines Baidu Search's content storage team's HTAP table storage system, detailing the challenges of supporting massive OLAP workloads on an OLTP‑oriented backend, the architectural split into Neptune and Saturn, storage‑engine optimizations such as row partitioning and dynamic columns, and a SQL‑like KQL framework for compute and scheduling.
Business Background
Baidu Search's content storage team is responsible for online read/write (OLTP) and offline high‑throughput computation (OLAP) of various data types such as webpages, images, and relationship graphs.
The original storage layer relied on Baidu's self‑developed table store, which is optimized for OLTP. Growing demands from large‑scale data analysis, AI model training, and full‑web analytics have increased reliance on OLAP, exposing severe resource, throughput, and latency challenges when using an OLTP‑centric architecture.
2.0 Requirement Analysis
The core problems of supporting OLAP workflows on an OLTP system are:
Severe I/O amplification caused by row‑level and column‑level filtering.
Compute‑storage coupling that leads to resource redundancy on storage nodes and storage‑space bloat.
2.1 Architecture Design
The design follows mainstream HTAP principles by separating OLTP and OLAP into two independent storage systems.
OLTP system – Neptune: Multi‑Raft distributed cluster using local SSD/HDD combined with Baidu Distributed File System (AFS) as the storage medium.
OLAP system – Saturn: Serverless design that loads on demand, matching the intermittent nature of OLAP workflows.
Data synchronization: Files are linked via hard links, enabling low‑cost, fast version replacement and fitting Saturn’s serverless model.
By routing OLTP and OLAP workflows to separate systems, the compute‑storage coupling issue is eliminated.
2.2 Storage Space Amplification Solution
Moving OLAP data files to low‑cost AFS reduces storage‑node pressure and mitigates space bloat.
2.3 Storage Engine Optimizations
2.2.1 Data Row Partition
Row partitioning, common in modern OLAP engines (e.g., ClickHouse, Iceberg), dramatically reduces I/O amplification during row‑level filtering.
Write: The request contains a partition key; the engine locates the appropriate Region‑Index and Region context, writes data, and records the operation in the WAL.
Read: An additional partition‑index lookup is required; the index is fully cached in memory because it is tiny, minimizing latency.
Scan: Partition metadata allows the scanner to skip large blocks of irrelevant rows, cutting I/O.
2.2.2 Incremental Data Filtering
For recent‑change queries (e.g., last few hours), the engine adjusts compaction timing and filter flow to limit the set of data files accessed, thereby reducing I/O amplification and speeding up the workload.
2.2.3 Dynamic Column Structure
To address column‑level I/O amplification, a dynamic column layout is introduced. Frequently accessed columns are grouped into the same physical locality group during compaction, while less‑used columns are placed elsewhere, reducing the amount of data read for column‑specific queries.
3. Compute and Scheduling
3.1 SQL‑like and FaaS‑enabled Queries
A custom query language, KQL, was built. KQL resembles SQL Server syntax but adds native support for defining and invoking Python FaaS functions and external FaaS packages.
function classify = { # define a Python FaaS function
def classify(cbytes, ids):
unique_ids = set(ids)
classify = int.from_bytes(cbytes, byteorder='little', signed=False)
while classify != 0:
tmp = classify & 0xFF
if tmp in unique_ids:
return True
classify = classify >> 8
return False
}
declare ids = [2, 8];
declare ts_end = function@gettimeofday_us(); # native time function
declare ts_beg = @ts_end - 24 * 3600 * 1000000; # 24‑hour window
select * from my_table region in timeliness
where timestamp between @ts_beg and @ts_end
filter by function@classify(@cf0:types, @ids)
convert by json outlet by row;
desc:
--multi_output=true;3.2 Task Generation and Scheduling
KQL statements are parsed into jobs composed of multiple tasks. Task generation optimizes the query plan; task scheduling can run on Map‑Reduce, Baidu’s Pioneer cluster, or other compute frameworks. Execution containers manage data dependencies and load a plug‑in compute framework that can invoke user‑defined FaaS functions.
Conclusion
The HTAP table storage system is already deployed in full‑web offline acceleration, AI model training data management, image storage, and other online/offline scenarios, handling more than 15 PB of data and delivering over 50 % performance improvement. With the advent of large‑model workloads, further optimizations are planned to achieve higher throughput and lower latency.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
