Big Data 24 min read

Why Data Lake Storage Acceleration Is the New Standard in Cloud‑Native AI

The article examines the evolution of data lake storage acceleration, compares various solutions, and explains how metadata, read/write, and end‑to‑end optimizations enable scalable, cost‑effective AI and big‑data workloads in cloud‑native environments.

Baidu Intelligent Cloud Tech Hub

Nov 12, 2024

Why Data Lake Storage Acceleration Is the New Standard in Cloud‑Native AI

1 Data Lake Storage Becomes the Fact Standard in the Cloud‑Native Era

For client H, the recommendation is to adopt a cloud‑native data lake that unifies raw data into a single storage foundation and provides open interfaces for compute and applications, preserving a Single Source of Truth while addressing scalability, elasticity, and cost.

Near‑infinite scalability: Object‑storage‑based data lakes use flat metadata structures that scale horizontally to billions of objects, ideal for AI’s massive small‑file workloads.

Flexible resource elasticity: Object storage offers pay‑as‑you‑go scaling and on‑demand capacity, leveraging large resource pools.

Extreme storage cost efficiency: Erasure coding and tiered storage (standard, infrequent, cold, archive) dramatically reduce space usage.

These advantages apply to both AI and big‑data scenarios, and new storage formats such as Hudi and Iceberg are optimized for object‑storage characteristics.

2 Why Is Additional Data Lake Storage Acceleration Still Needed?

Even with object storage, training speed can suffer due to metadata LIST operations, high‑frequency small‑file HEAD/READ requests, and bandwidth throttling across overlay/underlay networks.

From the object‑storage side, flat directories cause costly LIST scans, and HTTP‑based HEAD/READ traverses long paths, while erasure coding can amplify small‑file read overhead.

Metadata performance: LIST must scan entire sub‑trees, increasing traversal time.

Small I/O performance: Each LIST/HEAD/READ passes through load balancers and web services, adding latency and potential read amplification.

Bandwidth limits: Compute often resides in an overlay network, requiring traversal to the underlay storage network, incurring additional latency and throttling.

3 The Birth and Development of Data Lake Storage Acceleration

Early high‑performance computing (HPC) workloads moved from NAS to parallel file systems (GPFS, Lustre, BeeGFS) that offered striped, MPI‑I/O‑based parallel reads/writes on HDDs and later SSDs.

3.1 Parallel File Systems

Parallel file systems excelled in performance but incurred high storage costs for data‑intensive workloads.

3.2 Balancing Cost: Parallel File Systems + Object Storage

Two stages emerged: initially, object storage served as cold backup for parallel file systems; later, data lakes shifted the primary storage to object storage with parallel file systems acting as a cache layer, though challenges remained around data coupling and transparent loading.

3.3 Transparent Flow: Object Storage + Cache Systems

Alluxio introduced a virtual distributed file system that abstracts heterogeneous storage, enabling near‑compute caching that reduces latency by an order of magnitude and supports data‑aware scheduling.

Major cloud providers now offer cache‑accelerated products (AWS FileCache, Baidu RapidFS, Alibaba JindoFS, Tencent GooseFS) that approach parallel file‑system performance.

3.4 Complete Semantics

Two solution families address metadata and data‑write limitations:

3.4.1 Solution 1: Cloud‑Native File System + Object Storage

Systems like JuiceFS rebuild hierarchical metadata on top of object storage (using Redis or TiKV) and store chunked data in object storage, enabling append‑write and read‑while‑write semantics, though they introduce data‑intrusiveness.

3.4.2 Solution 2: File‑Object Fusion + Cache System

Some providers implement hierarchical metadata services directly within object storage and support streaming append writes, achieving full file semantics without sacrificing object‑storage benefits.

4 Key Problems Solved by Data Lake Storage Acceleration

4.1 Metadata Acceleration

Metadata operations dominate latency in AI and big‑data workloads with many small files. Near‑compute hierarchical metadata services, in‑memory caching, and hierarchical directory trees dramatically reduce latency from tens of milliseconds to sub‑millisecond levels.

Deploy metadata services within the business VPC overlay to shorten paths.

Use hierarchical directory trees to avoid costly flat‑directory LIST/RENAME operations.

Scale metadata horizontally (e.g., Redis for ~100 M entries, distributed KV for larger scales) or vertically via tiered caching.

4.2 Data Read/Write Acceleration

Accelerators place high‑spec NVMe SSDs and RDMA networks near compute, employ data chunking and striping, support multi‑replica reads, and optimize I/O paths through kernel‑bypass techniques (virtio‑fs, zero‑copy) to reduce latency and bandwidth consumption.

4.3 End‑to‑End Efficiency

Accelerators provide low‑cost, POSIX‑compatible access for AI and big‑data frameworks, integrate with object‑storage inventories for automated data loading, and enable pipeline‑driven data scheduling to keep hot data near compute while transparently spilling cold data to object storage.

big data data lake AI training metadata optimization storage acceleration

Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.