Big Data 12 min read

How Baidu’s Data Lake Acceleration 2.0 Supercharges Big Data and AI Workloads

Baidu's latest data lake acceleration 2.0 replaces HDFS with a scalable object‑storage foundation, introduces a hierarchical Namespace 2.0, a high‑throughput streaming engine, RapidFS caching, and a fully HDFS‑compatible BOS‑HDFS layer, delivering up to 70% higher throughput and dramatically lower costs for big data and AI pipelines.

Baidu Intelligent Cloud Tech Hub

Oct 21, 2024

How Baidu’s Data Lake Acceleration 2.0 Supercharges Big Data and AI Workloads

Background

The concept of a data lake has evolved for over a decade, with most organizations moving from traditional HDFS storage to object storage for better scalability.

Limitations of HDFS

Resource coupling of compute and storage leads to waste when capacity planning mismatches.

Scale ceiling: a single HDFS NameNode struggles beyond 1 billion files, while modern AI workloads require tens to hundreds of billions.

Heavy operational burden requiring specialized HDFS expertise.

Advantages of Object Storage

Compute‑storage separation enables independent scaling and greater elasticity.

Higher scalability, cloud‑native design, multi‑tiered cost‑effective storage, and minimal operational overhead.

Challenges with Object Storage

Performance can degrade due to added authentication, protocol translation, and increased latency from gateway and front‑end/back‑end hops. Compatibility with existing Hadoop ecosystems (MapReduce, Spark, Hive) also poses difficulties.

Data Lake Acceleration 2.0 Highlights

Namespace 2.0 : Adaptive hierarchical namespace that balances scale and performance.

Streaming Storage Engine : Optimized for big data, delivering >70% higher single‑stream throughput.

RapidFS Managed Service : Provides end‑to‑end acceleration for data caching and write acceleration.

BOS‑HDFS : Fully HDFS‑API compatible layer enabling seamless migration without code changes.

Namespace Evolution

Three generations are described:

First‑gen: flat, in‑memory directory tree (e.g., classic HDFS) – high performance but limited to ~1 billion entries.

Second‑gen: distributed database‑backed (e.g., Facebook Tectonic) – linear scalability but higher latency due to RPC and two‑phase commits.

Third‑gen (Baidu): a unified single‑node/distributed architecture built on TafDB, offering microsecond latency at small scale and seamless horizontal scaling at large scale.

Implementation Details

In the single‑node mode, inode and directory metadata are co‑located on the same storage nodes, eliminating cross‑node transactions. When the file count exceeds the 1 billion threshold, the metadata database shards automatically, transitioning to a distributed layout without impacting applications.

To improve throughput, Baidu pushes file‑semantic operations down to the database layer, achieving up to 100 k TPS per bucket.

Engine Optimizations for Big Data & AI

The storage engine now creates larger sequential blocks for large, sequential reads, avoiding hotspots and leveraging HDD sequential throughput, achieving ~300 MB/s per stream (≈70% improvement over native HDFS).

RapidFS Caching Benefits

For random‑read‑intensive workloads, caching large blocks in RapidFS yields >3× performance gains.

In multimodal training, intelligent pre‑warming to RapidFS raises GPU utilization above 98%.

Model distribution for inference sees >50% latency reduction and minute‑level deployment across thousands of GPUs.

Checkpoint persistence for large‑model training is accelerated, cutting write time by 80%.

HDFS Compatibility via BOS‑HDFS

BOS‑HDFS offers 100% API compatibility, supporting atomic rename, vectored I/O, append/truncate, and additional features such as intelligent tiering, quota, audit logging, and server‑side encryption, with seamless Kerberos + Ranger to token authentication migration.

Typical Customer Scenario

A leading domestic AI customer reduced data preprocessing costs by 60% using BOS as the lake storage, accelerated model training with PFS + BOS (hot data on PFS, cold on BOS), and cut inference model distribution latency by over 50% with RapidFS.

Conclusion

The integrated stack—BOS object storage, RapidFS, CFS, and PFS—provides end‑to‑end solutions for data processing, model development, training, and inference in cloud‑native environments.