Big Data 11 min read

Accelerating Data Lake Storage for Big Data and AI: Baidu's Solutions

Baidu’s Data Lake Storage Acceleration 2.0 replaces traditional HDFS with a scalable object‑storage foundation, introducing an adaptive hierarchical namespace, high‑throughput streaming engine, RapidFS caching, and fully compatible BOS‑HDFS APIs, thereby delivering up to 70 % higher throughput, lower costs, and seamless migration for big‑data and AI workloads.

Baidu Tech Salon
Baidu Tech Salon
Baidu Tech Salon
Accelerating Data Lake Storage for Big Data and AI: Baidu's Solutions

This article is based on Baidu Cloud Summit 2024 – Cloud‑Native Forum presentation.

It introduces the progress of Baidu Canghai’s storage team in accelerating data‑lake workloads.

Data lake concepts have existed for more than a decade, and the storage foundation is shifting from traditional HDFS to object storage.

Traditional big‑data compute frameworks (MapReduce, Spark, Hive) are built on HDFS, which suffers from three main drawbacks:

Resource coupling: compute and storage resources are mixed, making capacity planning difficult and leading to waste.

Scale limitation: a single HDFS NameNode can handle up to about 1 billion files, while modern AI training can require tens or hundreds of billions of files.

Operational burden: HDFS requires experienced engineers to maintain petabyte‑scale clusters.

Object storage addresses these issues by providing:

Separation of compute and storage, allowing independent scaling and greater elasticity.

Better scalability, cloud‑native design, low‑maintenance multi‑tier storage, and lower cost.

However, replacing HDFS with object storage introduces new challenges:

Performance degradation due to additional authentication, protocol translation, and increased latency between compute and storage nodes, resulting in lower throughput.

Compatibility with existing HDFS‑based ecosystems (MapReduce, Spark, Hive) because of different access protocols and authentication mechanisms.

To better accelerate upstream big‑data and AI workloads, Baidu released Data Lake Storage Acceleration 2.0, which includes:

Namespace 2.0 – a scale‑adaptive hierarchical namespace that balances scale and performance.

A streaming storage engine optimized for big data, delivering >70% higher single‑stream throughput compared to HDFS.

RapidFS – a managed caching product that speeds up data caching and write acceleration.

BOS‑HDFS – a new version fully compatible with HDFS APIs, enabling seamless migration of existing workloads.

The presentation then details the evolution of hierarchical namespaces:

First generation: single‑machine, in‑memory directory tree (e.g., classic HDFS), high performance but limited scalability.

Second generation: distributed‑database‑backed (e.g., Facebook Tectonic), linear scalability but higher latency due to RPC and two‑phase commits.

Third generation (Baidu’s solution): a unified single‑machine/distributed architecture built on Baidu’s TafDB metadata store, offering scale‑adaptive performance. It operates as a single‑machine namespace at small scale (microsecond latency) and smoothly transitions to a distributed namespace when the file count reaches the billion‑level threshold.

Key implementation details:

In single‑machine mode, inode and directory metadata are co‑located on the same storage node, eliminating cross‑node transactions.

When the bucket reaches the scale threshold, TafDB automatically shards tables, converting single‑node transactions to cross‑node transactions without affecting the upper‑layer APIs.

File‑semantic operations are pushed down to the distributed database layer, achieving up to 100 k TPS per bucket.

Object‑storage backend optimizations for big‑data and AI include:

Larger block sizes with sequential placement to exploit HDD sequential read throughput, achieving ~300 MB/s per stream (70%+ improvement over native HDFS).

RapidFS caching for latency‑sensitive workloads, delivering >3× performance gains and up to 98% GPU utilization in multimodal training.

Accelerated checkpoint persistence (4/5 reduction in time) by writing first to RapidFS then asynchronously to object storage.

BOS‑HDFS provides 100% HDFS API compatibility, atomic rename, vectored I/O, append/truncate, and integrates Kerberos + Ranger token authentication, enabling zero‑code‑change migration of Hadoop ecosystems to Baidu Cloud.

Use‑case highlights:

Customer H reduced data preprocessing costs by 60% by using BOS as the data‑lake foundation instead of self‑managed HDFS.

During model training, hot data was loaded from PFS while cold data remained in BOS, leveraging the acceleration stack.

Model inference benefited from RapidFS distribution, cutting end‑to‑end latency by over 50%.

The presentation concludes with a panoramic view of Baidu Canghai’s support for large‑model scenarios, showing how object storage (BOS), RapidFS, CFS, and PFS form a comprehensive end‑to‑end solution for data processing, model development, training, and inference.

Big DataAIdata lakeObject StorageBOS-HDFSRapidFS
Baidu Tech Salon
Written by

Baidu Tech Salon

Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.