Why Data Lake Storage Acceleration Is the New Standard in Cloud‑Native AI
The article examines the evolution of data lake storage acceleration, compares various solutions, and explains how metadata, read/write, and end‑to‑end optimizations enable scalable, cost‑effective AI and big‑data workloads in cloud‑native environments.
1 Data Lake Storage Becomes the Fact Standard in the Cloud‑Native Era
For client H, the recommendation is to adopt a cloud‑native data lake that unifies raw data into a single storage foundation and provides open interfaces for compute and applications, preserving a Single Source of Truth while addressing scalability, elasticity, and cost.
Near‑infinite scalability: Object‑storage‑based data lakes use flat metadata structures that scale horizontally to billions of objects, ideal for AI’s massive small‑file workloads.
Flexible resource elasticity: Object storage offers pay‑as‑you‑go scaling and on‑demand capacity, leveraging large resource pools.
Extreme storage cost efficiency: Erasure coding and tiered storage (standard, infrequent, cold, archive) dramatically reduce space usage.
These advantages apply to both AI and big‑data scenarios, and new storage formats such as Hudi and Iceberg are optimized for object‑storage characteristics.
2 Why Is Additional Data Lake Storage Acceleration Still Needed?
Even with object storage, training speed can suffer due to metadata LIST operations, high‑frequency small‑file HEAD/READ requests, and bandwidth throttling across overlay/underlay networks.
From the object‑storage side, flat directories cause costly LIST scans, and HTTP‑based HEAD/READ traverses long paths, while erasure coding can amplify small‑file read overhead.
Metadata performance: LIST must scan entire sub‑trees, increasing traversal time.
Small I/O performance: Each LIST/HEAD/READ passes through load balancers and web services, adding latency and potential read amplification.
Bandwidth limits: Compute often resides in an overlay network, requiring traversal to the underlay storage network, incurring additional latency and throttling.
3 The Birth and Development of Data Lake Storage Acceleration
Early high‑performance computing (HPC) workloads moved from NAS to parallel file systems (GPFS, Lustre, BeeGFS) that offered striped, MPI‑I/O‑based parallel reads/writes on HDDs and later SSDs.
3.1 Parallel File Systems
Parallel file systems excelled in performance but incurred high storage costs for data‑intensive workloads.
3.2 Balancing Cost: Parallel File Systems + Object Storage
Two stages emerged: initially, object storage served as cold backup for parallel file systems; later, data lakes shifted the primary storage to object storage with parallel file systems acting as a cache layer, though challenges remained around data coupling and transparent loading.
3.3 Transparent Flow: Object Storage + Cache Systems
Alluxio introduced a virtual distributed file system that abstracts heterogeneous storage, enabling near‑compute caching that reduces latency by an order of magnitude and supports data‑aware scheduling.
Major cloud providers now offer cache‑accelerated products (AWS FileCache, Baidu RapidFS, Alibaba JindoFS, Tencent GooseFS) that approach parallel file‑system performance.
3.4 Complete Semantics
Two solution families address metadata and data‑write limitations:
3.4.1 Solution 1: Cloud‑Native File System + Object Storage
Systems like JuiceFS rebuild hierarchical metadata on top of object storage (using Redis or TiKV) and store chunked data in object storage, enabling append‑write and read‑while‑write semantics, though they introduce data‑intrusiveness.
3.4.2 Solution 2: File‑Object Fusion + Cache System
Some providers implement hierarchical metadata services directly within object storage and support streaming append writes, achieving full file semantics without sacrificing object‑storage benefits.
4 Key Problems Solved by Data Lake Storage Acceleration
4.1 Metadata Acceleration
Metadata operations dominate latency in AI and big‑data workloads with many small files. Near‑compute hierarchical metadata services, in‑memory caching, and hierarchical directory trees dramatically reduce latency from tens of milliseconds to sub‑millisecond levels.
Deploy metadata services within the business VPC overlay to shorten paths.
Use hierarchical directory trees to avoid costly flat‑directory LIST/RENAME operations.
Scale metadata horizontally (e.g., Redis for ~100 M entries, distributed KV for larger scales) or vertically via tiered caching.
4.2 Data Read/Write Acceleration
Accelerators place high‑spec NVMe SSDs and RDMA networks near compute, employ data chunking and striping, support multi‑replica reads, and optimize I/O paths through kernel‑bypass techniques (virtio‑fs, zero‑copy) to reduce latency and bandwidth consumption.
4.3 End‑to‑End Efficiency
Accelerators provide low‑cost, POSIX‑compatible access for AI and big‑data frameworks, integrate with object‑storage inventories for automated data loading, and enable pipeline‑driven data scheduling to keep hot data near compute while transparently spilling cold data to object storage.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
