Big Data 12 min read

Why Object Storage Is Replacing HDFS for Modern Data Lakes: Baidu’s 2.0 Acceleration

Data lakes have evolved from HDFS to object storage, addressing resource inefficiency, scalability limits, and operational burdens; Baidu’s Data Lake Storage Acceleration 2.0 introduces hierarchical Namespace 2.0, a streaming storage engine, RapidFS caching, and a fully HDFS‑compatible BOS‑HDFS layer to boost performance and support massive AI workloads.

Baidu Geek Talk

Nov 4, 2024

Why Object Storage Is Replacing HDFS for Modern Data Lakes: Baidu’s 2.0 Acceleration

Background: From HDFS to Object Storage

Since its inception in 2012, the concept of a data lake has been interpreted differently by various companies, but a clear trend has emerged: the storage foundation is shifting from traditional HDFS to object storage.

Limitations of Traditional HDFS‑Based Big Data Stacks

Resource coupling – compute and storage are tightly bound, making capacity planning for future workloads difficult and often leading to wasted resources.

Scalability ceiling – a single HDFS NameNode can manage up to about 1 billion files, far below the hundreds of billions of files required by large‑model training.

Operational overhead – managing multi‑petabyte HDFS clusters demands specialized expertise and heavy maintenance effort.

Advantages of Object Storage for Data Lakes

Compute‑storage separation enables independent scaling of resources, providing greater elasticity.

Higher scalability and cloud‑native characteristics such as low‑maintenance multi‑tier storage reduce total cost.

Remaining Challenges When Replacing HDFS

Performance degradation, compatibility with existing Hadoop ecosystems, and increased latency due to authentication, protocol translation, and network hops are the primary obstacles.

Baidu’s Data Lake Storage Acceleration 2.0

Namespace 2.0 – a hierarchical namespace that adapts storage architecture based on scale, achieving a balance between size and performance.

Streaming Storage Engine – optimized for big‑data and AI workloads, delivering more than 70% higher single‑stream throughput compared with HDFS.

RapidFS – a hosted caching product that accelerates data reads/writes, supports up to 100 k TPS per bucket, and improves random‑read performance by over 3×.

BOS‑HDFS – a new version offering 100% HDFS API compatibility, enabling seamless migration of Hadoop, Spark, Hive, and other upstream components without code changes.

Architecture Details: Single‑Node vs Distributed Namespace

Both architectures are built on Baidu’s self‑developed distributed metadata store (TafDB). In the single‑node mode, inode and directory metadata reside on the same storage node, eliminating cross‑node RPCs and preserving low latency. When the file count reaches the 1 billion threshold, the system automatically splits the metadata tables across multiple nodes, transitioning to a distributed mode without affecting upper‑layer applications.

To improve throughput, file‑semantic operations are pushed down to the distributed database layer, reducing communication with higher‑level components and achieving up to 100 k TPS per bucket.

Performance Optimizations for Big Data and AI

The storage engine now creates larger blocks and places them sequentially, leveraging HDD sequential read throughput and avoiding hotspots. Single‑stream throughput reaches ~300 MB/s, a >70% improvement over native HDFS.

RapidFS caching dramatically speeds up random‑read‑intensive workloads, providing >3× performance gains, and in multimodal training scenarios it raises GPU utilization by over 98% through intelligent pre‑warming.

For large‑model checkpoint persistence, data is first written to RapidFS’s distributed cache and then asynchronously flushed to object storage, cutting checkpoint write time by 80%.

Compatibility, Security, and Additional Features

BOS‑HDFS supports atomic rename, vectored I/O, file append/truncate, and other POSIX semantics, and integrates with Kerberos, Ranger, and temporary token mechanisms for seamless authentication.

Additional capabilities include intelligent data tiering, sub‑directory quotas, audit logging, and server‑side encryption.

Typical Use Cases

Data preprocessing: using BOS as the lake storage reduces costs by 60% compared with self‑built HDFS.

Model training: hot data is served from PFS, cold data from BOS, achieving high throughput.

Model inference: RapidFS accelerates model distribution, cutting end‑to‑end latency by more than 50%.

Conclusion

By combining object‑storage‑based scalability with a flexible hierarchical namespace and high‑performance caching, Baidu’s Data Lake Storage Acceleration 2.0 provides an end‑to‑end solution for big‑data and AI workloads, offering seamless HDFS compatibility, reduced operational burden, and significant performance gains.