Artificial Intelligence 19 min read

How Object Storage Accelerates Large AI Model Training and Inference

This article examines the storage challenges posed by large AI models, analyzes the full workflow from data ingestion to inference, compares HDFS and object‑storage data lakes, and presents Baidu's cloud‑native storage‑acceleration solutions—including RapidFS and PFS—that dramatically improve training speed, checkpoint handling, and model deployment throughput.

Baidu Intelligent Cloud Tech Hub

Jul 10, 2023

How Object Storage Accelerates Large AI Model Training and Inference

This is the third session of Baidu's AI large‑model series, focusing on storage acceleration for large models.

We explore three topics: the new storage challenges introduced by end‑to‑end large‑model workflows, specific storage problems in each stage, and Baidu's storage‑acceleration solutions with practical experience.

1. New Storage Challenges for Large Models

Model parameters have exploded to billions, bringing unprecedented performance gains but also massive infrastructure demands: extremely large model size and long training times require ultra‑high performance and stability; models must be tightly coupled with business applications, demanding agile, large‑scale deployment; and continuous massive data updates require seamless data‑ecosystem integration.

We split the full workflow into four stages:

Massive data storage and processing (ingest, cleaning, transformation, annotation, sharing, archiving) – requiring high throughput, large capacity, and ecosystem interoperability.

Model development – emphasizing POSIX compatibility, reliability, and shareability.

Model training – needing fast data reads, high‑throughput checkpoint writes, and minimal I/O wait.

Model inference – demanding high concurrency, high throughput, and streamlined deployment.

These stages reveal two core challenges: diverse storage requirements across stages, and the need for efficient data flow throughout the AI pipeline.

2. Solutions to Full‑Process Storage Problems

The key is a cloud‑native data lake based on object storage, which supersedes traditional HDFS. Object storage offers superior horizontal scalability and lower long‑term cost, while still supporting high throughput for large files.

Compared with HDFS, object storage provides:

Better scalability through distributed flat metadata.

More cost‑effective storage via erasure coding and tiered storage.

In the cloud‑native era, object‑storage data lakes become the central data hub.

To address the remaining shortcomings of object storage for small files, we adopt three strategies:

Package small files (e.g., TFRecord, HDF5) to reduce metadata overhead.

Deploy high‑performance hardware or parallel file systems to shorten I/O paths.

Introduce caching layers (Memory or NVMe) close to compute to accelerate reads.

Combining a data lake with an acceleration layer (RapidFS or PFS) resolves the identified challenges.

We illustrate three concrete scenarios:

Dataset read acceleration : Automatic data‑flow links pre‑load data into the acceleration layer, overlapping data preparation with training to hide I/O latency.

Checkpoint write acceleration : Directly write checkpoints to the acceleration layer (Memory/NVMe) and stream them to object storage asynchronously, drastically reducing checkpoint pause time.

Inference model distribution : Models are written to the data lake; event notifications trigger pre‑loading into distributed caches near inference services, enabling high‑throughput, low‑latency model serving even at thousands of instances.

Performance tests show that RapidFS cuts training time by several folds compared with raw BOS, improves GPU utilization, reduces checkpoint time from minutes to seconds, and scales model‑distribution throughput linearly with cache nodes.

3. Baidu Canghai Storage‑Acceleration Solution

The architecture consists of:

Object storage BOS as the scalable cloud‑native data lake.

Acceleration layer: Parallel File System (PFS) for extreme performance or RapidFS for cost‑effective distributed caching.

AI compute layer with heterogeneous accelerators and cloud‑native AI platform.

Both PFS and RapidFS automatically load and pre‑heat selected data, synchronize results back to BOS, and release resources after training.

Real‑world benchmarks demonstrate multi‑fold speedups in training, checkpointing, and inference model distribution.

cloud-native AI large models Object Storage

Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.