How Baidu’s Data Lake Acceleration 2.0 Supercharges Big Data and AI Workloads
Baidu's latest data lake acceleration 2.0 replaces HDFS with a scalable object‑storage foundation, introduces a hierarchical Namespace 2.0, a high‑throughput streaming engine, RapidFS caching, and a fully HDFS‑compatible BOS‑HDFS layer, delivering up to 70% higher throughput and dramatically lower costs for big data and AI pipelines.
Background
The concept of a data lake has evolved for over a decade, with most organizations moving from traditional HDFS storage to object storage for better scalability.
Limitations of HDFS
Resource coupling of compute and storage leads to waste when capacity planning mismatches.
Scale ceiling: a single HDFS NameNode struggles beyond 1 billion files, while modern AI workloads require tens to hundreds of billions.
Heavy operational burden requiring specialized HDFS expertise.
Advantages of Object Storage
Compute‑storage separation enables independent scaling and greater elasticity.
Higher scalability, cloud‑native design, multi‑tiered cost‑effective storage, and minimal operational overhead.
Challenges with Object Storage
Performance can degrade due to added authentication, protocol translation, and increased latency from gateway and front‑end/back‑end hops. Compatibility with existing Hadoop ecosystems (MapReduce, Spark, Hive) also poses difficulties.
Data Lake Acceleration 2.0 Highlights
Namespace 2.0 : Adaptive hierarchical namespace that balances scale and performance.
Streaming Storage Engine : Optimized for big data, delivering >70% higher single‑stream throughput.
RapidFS Managed Service : Provides end‑to‑end acceleration for data caching and write acceleration.
BOS‑HDFS : Fully HDFS‑API compatible layer enabling seamless migration without code changes.
Namespace Evolution
Three generations are described:
First‑gen: flat, in‑memory directory tree (e.g., classic HDFS) – high performance but limited to ~1 billion entries.
Second‑gen: distributed database‑backed (e.g., Facebook Tectonic) – linear scalability but higher latency due to RPC and two‑phase commits.
Third‑gen (Baidu): a unified single‑node/distributed architecture built on TafDB, offering microsecond latency at small scale and seamless horizontal scaling at large scale.
Implementation Details
In the single‑node mode, inode and directory metadata are co‑located on the same storage nodes, eliminating cross‑node transactions. When the file count exceeds the 1 billion threshold, the metadata database shards automatically, transitioning to a distributed layout without impacting applications.
To improve throughput, Baidu pushes file‑semantic operations down to the database layer, achieving up to 100 k TPS per bucket.
Engine Optimizations for Big Data & AI
The storage engine now creates larger sequential blocks for large, sequential reads, avoiding hotspots and leveraging HDD sequential throughput, achieving ~300 MB/s per stream (≈70% improvement over native HDFS).
RapidFS Caching Benefits
For random‑read‑intensive workloads, caching large blocks in RapidFS yields >3× performance gains.
In multimodal training, intelligent pre‑warming to RapidFS raises GPU utilization above 98%.
Model distribution for inference sees >50% latency reduction and minute‑level deployment across thousands of GPUs.
Checkpoint persistence for large‑model training is accelerated, cutting write time by 80%.
HDFS Compatibility via BOS‑HDFS
BOS‑HDFS offers 100% API compatibility, supporting atomic rename, vectored I/O, append/truncate, and additional features such as intelligent tiering, quota, audit logging, and server‑side encryption, with seamless Kerberos + Ranger to token authentication migration.
Typical Customer Scenario
A leading domestic AI customer reduced data preprocessing costs by 60% using BOS as the lake storage, accelerated model training with PFS + BOS (hot data on PFS, cold on BOS), and cut inference model distribution latency by over 50% with RapidFS.
Conclusion
The integrated stack—BOS object storage, RapidFS, CFS, and PFS—provides end‑to‑end solutions for data processing, model development, training, and inference in cloud‑native environments.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
