Industry Insights 24 min read

Why Cloud‑Native Data Lakes Are the New Standard for Storage Acceleration

This article analyzes the evolution of data‑lake storage acceleration, compares traditional parallel file systems, object‑storage‑based solutions and modern cache‑enabled architectures, and explains how cloud‑native data lakes address scalability, cost, and performance challenges for AI and big‑data workloads.

Baidu Geek Talk

Nov 13, 2024

Why Cloud‑Native Data Lakes Are the New Standard for Storage Acceleration

Background and Motivation

Rapid growth of AI and big‑data workloads exposes the limits of traditional on‑premise file systems and HDFS, especially when handling petabytes of small files, rising storage costs, and training speed bottlenecks.

Challenges Faced by Enterprises

Data scale: Small‑file workloads exceed the capacity of self‑built file systems; HDFS requires packing and unpacking massive numbers of files.

Storage cost: Multimodal data grows from tens of terabytes to several petabytes, making cost a critical factor.

Training speed: As GPU clusters expand, both self‑built file systems and HDFS become unable to keep up, turning storage into the main training bottleneck.

Cloud‑Native Data Lake Architecture

Adopt a cloud‑native data lake that unifies all raw data in object storage and exposes an open, unified interface to downstream compute and applications. This provides:

Near‑infinite scalability: Flat metadata structures in object storage scale horizontally to billions of objects, ideal for AI’s massive small‑file workloads.

Flexible resource elasticity: Compute and storage are decoupled, enabling pay‑as‑you‑go scaling and on‑demand capacity bursts.

Cost‑effective storage: Erasure coding reduces space usage versus multi‑replica schemes, and tiered storage (standard, infrequent, cold, archive) further lowers long‑term costs.

Why Storage Acceleration Is Still Needed

Even with object storage, metadata operations (LIST) and high‑frequency small‑file reads (HEAD/READ) can dominate latency. The separation of compute and storage adds network hops, load balancers, and possible bandwidth throttling.

AI Training Workflow and Storage Bottlenecks

A typical training loop reads raw data, shuffles it, feeds batches to GPUs, and periodically writes checkpoints. Two dominant storage interactions are:

Directory traversal (LIST) over large hierarchies.

Repeated small‑file reads (HEAD/READ) for each training sample.

Metadata Performance Issues

Flat directory LIST operations must scan entire sub‑trees, causing long traversal times.

Small‑I/O latency is inflated by HTTP routing through load balancers and the object‑storage backend; erasure coding can amplify read costs.

Bandwidth limits arise from overlay‑to‑underlay network traversal and cross‑region distances, leading to throttling and wasted GPU cycles.

Object‑Storage Acceleration Techniques

Deploy metadata services inside the business VPC overlay network and cache hot metadata in memory to shorten access paths.

Use hierarchical directory trees (native to parallel or cloud‑native file systems) to make LIST, RENAME, DELETE atomic and efficient.

Scale metadata horizontally via distributed key‑value stores (e.g., Redis, TiKV) or sharding across multiple nodes.

Evolution of Data Lake Storage Acceleration

Parallel File Systems

Early solutions such as GPFS, Lustre, and BeeGFS provide high‑performance parallel I/O on HDD/SSD but become cost‑prohibitive at petabyte scale.

Parallel FS + Object Storage

Combining parallel file systems with object storage creates a two‑layer architecture: fast hot storage on‑premise and cheap cold storage in the cloud. Data synchronization remains manual and opaque.

Transparent Flow: Object Storage + Cache System

Systems like Alluxio introduce a virtual distributed file system that abstracts heterogeneous storage, enabling near‑compute caching and transparent data movement across clouds.

Cloud‑Native File System + Object Storage

Solutions such as JuiceFS rebuild a hierarchical file system on top of object storage, storing metadata in external engines (Redis, TiKV) and chunking data for append‑write and read‑while‑write capabilities. This improves performance but introduces data‑intrusiveness and requires multiple instances for multi‑tenant environments.

File‑Object Fusion + Cache

Some cloud providers embed hierarchical directories and streaming write semantics directly inside the object‑storage layer while still using near‑compute caches. This approach offers full POSIX‑like semantics, high performance, and avoids data‑intrusiveness.

Key Problems Solved by Acceleration

Metadata Acceleration

Placing metadata services close to compute and caching hot metadata reduces LIST latency from tens of milliseconds to sub‑millisecond levels.

Data Read/Write Acceleration

Near‑compute placement, high‑spec NVMe SSDs, RDMA networking, and streaming storage engines dramatically cut read/write latency. Techniques include data chunking, multi‑replica selection, prefetching, and kernel‑level optimizations such as FUSE, virtiofs, and zero‑copy.

End‑to‑End Efficiency

Effective acceleration requires low‑cost, high‑performance access methods (HCFS SDK for big data, POSIX‑compatible mounts for AI), intelligent data‑flow orchestration (inventory import, auto‑load, incremental sync), and bidirectional strong coupling between cache layers and object storage to keep data consistent across pipeline stages.

cloud-native big data AI data lake Object Storage storage acceleration

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.