Artificial Intelligence 18 min read

Storage Acceleration Solutions for Large AI Model Workflows

To tackle the massive data, high‑throughput and low‑latency demands of large‑model training and inference, the talk proposes a unified data‑lake built on scalable object storage combined with an acceleration layer—either a parallel file system or cloud‑native RapidFS cache—demonstrating multi‑fold training speedups, faster checkpoint uploads, and linear inference scaling.

Baidu Geek Talk

Aug 7, 2023

Storage Acceleration Solutions for Large AI Model Workflows

This talk introduces the storage challenges posed by the full lifecycle of large AI models and presents a design and practice of storage acceleration solutions.

Challenges: Large models generate massive data volumes, require long training times, and need continuous data updates. The workflow can be divided into four stages: massive data storage and processing, model development, model training, and model inference. Each stage imposes distinct storage requirements such as high throughput, low latency, POSIX compatibility, and efficient checkpoint handling.

Solution Overview: A unified "Data Lake + Acceleration Layer" architecture is proposed. The data lake is built on object storage (BOS) to provide scalable, cost‑effective storage, while the acceleration layer (RapidFS or Parallel File System) offers high‑performance access for training and inference.

Object Storage vs. HDFS: Object storage offers flat, distributed metadata and superior horizontal scalability compared with HDFS, which suffers from limited metadata management and higher cost for small‑file workloads. Consequently, object storage is recommended as the primary data‑lake backend.

Acceleration Layer Options:

Parallel File System (PFS): deployed on dedicated high‑performance hardware for extreme I/O performance.

RapidFS: a cloud‑native cache that provides a cost‑effective performance boost by placing storage close to compute.

Key Scenarios:

Data Shuffle & Read: Small‑file shuffle is metadata‑intensive; using object‑storage‑based data lakes with caching reduces latency and improves QPS.

Checkpoint Writing: Large checkpoint files require high throughput; writing directly to the acceleration layer (memory or NVMe SSD) and streaming to object storage asynchronously cuts checkpoint time dramatically.

Model Inference Deployment: High‑concurrency model serving benefits from pre‑loading model files into cache via event‑driven data‑lake integration, eliminating repeated object‑storage reads.

Performance Results: Experiments show that RapidFS reduces overall training time by several folds, improves GPU utilization, shortens checkpoint upload from 355 s to 120 s, and scales inference throughput linearly with cache node count.

Conclusion: The combined data‑lake and acceleration‑layer approach addresses both storage capacity and performance challenges across the entire large‑model workflow, delivering a unified, cloud‑native solution.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Lake object-storage Accelerated Filesystem AI Model Storage

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.