Artificial Intelligence 38 min read

Why Storage Systems Bottleneck AI Training and How to Accelerate Them

This article examines the comprehensive challenges AI applications face from storage to compute, traces the evolution of AI training infrastructure, analyzes key bottlenecks such as compute acceleration, resource scheduling, massive data handling and data flow, and presents Baidu Cloud's storage acceleration solutions—including parallel file systems, caching, and the Fluid scheduler—to dramatically improve AI training performance.

Baidu Intelligent Cloud Tech Hub

Oct 19, 2022

Why Storage Systems Bottleneck AI Training and How to Accelerate Them

1. Opening

Today's talk is divided into three parts: a brief history of enterprise AI training infrastructure, analysis of storage‑related problems, and Baidu Cloud's full‑process storage acceleration solution.

2. Storage Issues in AI Training

Enterprise AI training infrastructure has evolved through four stages:

Stage 1: Small models and datasets, single‑machine training using local memory and disks.

Stage 2: Larger models require multi‑machine training; commercial network storage is adopted.

Stage 3: Massive scale leads to training platforms, differentiated storage needs (high‑capacity low‑cost for bulk data and high‑performance storage for hot data), prompting self‑built or open‑source storage stacks.

Stage 4: Cloud era – “big‑capacity + high‑performance” storage remains, but data flow changes dramatically.

In cloud‑native AI training, the bottom layer is a data‑lake storage offering high capacity, throughput, low cost and reliability. An acceleration layer sits above the data lake to meet AI’s high‑performance needs, while training platforms and AI frameworks reside close to compute.

3.1 Compute Acceleration

A typical training epoch involves shuffle (reading dataset order), batch reads, and checkpoint writes. Shuffle and batch reads are read‑heavy; checkpoint is a sequential write that usually has minor impact. The goal is to maximize the proportion of time spent on actual computation.

GPU training can pipeline reads: while the GPU computes batch N, the CPU pre‑fetches batch N+1. If read latency is shorter than compute time, the storage wait time becomes negligible.

Metadata operations (open, stat, close) dominate when dealing with millions of small files (e.g., ImageNet). Optimizations include maintaining a file list or using packed formats like TFRecord/HDF5 to turn metadata‑heavy operations into data‑heavy ones.

3.2 Resource Scheduling

Training platforms aim for high GPU utilization (>60%). Inefficient scheduling of data preparation and compute leads to idle GPU time. Overlapping data preparation with training (pipeline scheduling) can raise utilization from ~46% to ~62%.

The open‑source Fluid framework extracts data‑preparation as a separate step, allocates resources via Kubernetes, pre‑heats metadata and data, and optionally enforces node affinity between cache and compute for better performance and fault tolerance.

3.3 Massive Data

Beyond training, most AI‑related workflows need high‑throughput, shared, reliable storage—essentially a data‑lake. Object storage (e.g., Baidu BOS) and HDFS are common candidates; object storage offers lower cost, better scalability, and higher availability.

3.4 Data Flow

In cloud‑native architectures, data originates from the data lake and flows to a fast acceleration layer before training, reversing the traditional “high‑performance‑to‑cold‑storage” direction. Simple cp/rsync approaches are slow, inflexible, and hard to integrate with schedulers.

Embedding data‑sync capabilities in the acceleration layer (parallel file system or RapidFS) solves speed, policy, and integration issues.

4. Baidu Canghai Storage Acceleration Solution

The solution combines:

Data lake: Baidu BOS with tiered storage and intelligent lifecycle.

Acceleration layer: Parallel File System (PFS) and RapidFS deployed on bare‑metal or VMs, supporting Bucket Link for automatic data lake sync and Fluid for scheduling.

Key benefits:

Massive data handled by BOS.

Data flow accelerated via Bucket Link.

Resource scheduling unified through Fluid.

Compute acceleration achieved by high‑performance hardware close to compute.

Benchmarks show that using PFS or RapidFS with pre‑heat yields near‑100% GPU utilization, while direct object‑storage training suffers low utilization.

Q&A

Why does object storage enable compute‑storage separation while HDFS does not?

Object storage decouples compute and storage, offers better scalability, lower cost, and multi‑region reliability, making it the preferred choice for storage‑compute separation.

What factors dominate storage system selection?

Understand business access patterns—throughput‑heavy large files vs. metadata‑heavy small files—and choose hardware/software accordingly.

How to bridge on‑premise storage with public‑cloud storage?

Use object‑storage SDKs, FUSE, or tools like rsync; for massive migrations Baidu offers “Moonlight Box” hardware for offline transfer.

Differences between Ceph and HDFS?

Both are software‑defined storage, but HDFS targets big‑data workloads with a subset of POSIX, while Ceph provides full POSIX, block, and object interfaces.

Can storage know which files to pre‑fetch during AI training?

Only if the framework informs the storage; emerging solutions expose non‑standard interfaces for proactive data placement.

Is RapidFS POSIX‑compatible?

RapidFS offers HDFS‑compatible interfaces in two modes: a cache‑only mode (similar to Alluxio) and a cloud‑native file system mode (similar to JuiceFS).

Does RDMA outperform high‑performance SSDs?

They complement each other; performance depends on end‑to‑end latency, not a simple replacement.

Should write‑intensive workloads use PFS?

Yes, PFS provides full POSIX support suitable for random writes, whereas RapidFS is optimized for read‑heavy caching.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Performance Optimization Data Lake AI training parallel file system storage acceleration

Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.