Cloud Computing 29 min read

How Baidu’s Cloud Storage Powers High‑Performance Computing and AI Workloads

This article explains the storage challenges of high‑performance computing—including traditional HPC, AI‑driven HPC, and HPDA—then details Baidu’s unified storage platform, object storage BOS, and runtime solutions PFS and RapidFS, illustrating their architecture, features, and a real‑world autonomous‑driving customer case.

Baidu Intelligent Cloud Tech Hub

Jul 21, 2022

How Baidu’s Cloud Storage Powers High‑Performance Computing and AI Workloads

1. Storage Challenges in High‑Performance Computing Scenarios

High‑Performance Computing (HPC) encompasses three main categories: traditional supercomputing, AI‑driven HPC (AI HPC), and High‑Performance Data Analytics (HPDA), each presenting distinct storage requirements such as high throughput, low latency, and handling massive small files.

1.1 What Is High‑Performance Computing?

HPC refers to computing workloads that require performance far beyond typical personal computers, often using clusters of servers. Recent trends show expanding application domains, increasing adoption of cloud‑based HPC, and growing convergence with AI and big‑data workloads.

Trend 1: More diverse HPC use cases across industries.

Trend 2: Migration of HPC workloads to the cloud for elastic resource provisioning.

Trend 3: Cross‑domain collaboration between HPC, AI, and big‑data technologies.

1.2 Storage Issues in Traditional HPC

Traditional HPC workloads generate large matrix files that are accessed randomly by many processes, leading to inefficient small I/O operations and the need for coordinated process synchronization.

I/O efficiency problem: scattered random reads/writes degrade performance, especially on mechanical disks.

Process coordination problem: all processes must signal completion before the overall computation can finish.

Two‑stage I/O aggregation is used to combine many small I/O requests into larger sequential operations, typically implemented via the MPI‑I/O framework on top of POSIX file interfaces.

1.3 Storage Issues in AI HPC

AI training involves repeated data loading and checkpointing. Large numbers of small files (e.g., image datasets) stress metadata performance, while high‑throughput GPU training demands POSIX and K8s CSI compatible storage, optionally leveraging GPU Direct Storage.

1.4 Storage Issues in HPDA

HPDA workloads, such as large‑scale MapReduce jobs, primarily handle big files and require very high throughput but are less sensitive to latency. Object storage with HCFS compatibility is commonly used.

1.5 Summary of HPC Storage Requirements

Across all HPC categories, the common needs are high throughput, low latency (for HPC and AI HPC), support for massive small‑file metadata operations (AI HPC), POSIX and MPI‑I/O interfaces, data durability, and cost‑effective storage tiers.

2. Baidu’s Internal High‑Performance Storage Practices

Baidu operates a unified storage platform that provides high reliability, low cost, and high throughput, supporting POSIX, HCFS, and custom SDKs. It powers diverse workloads such as autonomous driving, speech recognition, and ad recommendation.

Two runtime storage solutions are offered:

Local‑disk or parallel file system (PFS) for workloads with many small files, delivering high metadata performance.

Direct access to the storage base for long‑running, throughput‑focused jobs.

The platform also automates data movement, mounting, and capacity provisioning, simplifying user interaction.

3. Baidu Canghai High‑Performance Storage Solution

The solution combines a large‑capacity, high‑throughput, low‑cost storage base (BOS) with fast runtime stores PFS and RapidFS.

3.1 Parallel File System (PFS)

PFS is a Lustre‑like parallel file system that provides one‑hop I/O paths via dedicated metadata (MDS) and data (OSS) nodes, deployed close to compute resources using RDMA or high‑speed TCP.

3.2 Distributed Cache Accelerator (RapidFS)

RapidFS leverages idle memory and disk on compute nodes to create a peer‑to‑peer cache, offering hierarchical namespace caching and data caching to accelerate access to object storage.

3.3 Efficient Data Transfer

Data movement between storage base, PFS, and RapidFS is handled by lifecycle policies (automatic tiering) and Bucket Link, which binds a PFS/RapidFS namespace to an object‑storage path for seamless pre‑loading.

3.4 Unified Scheduling

Bucket Link is integrated with Kubernetes via the open‑source Fluid project, enabling pipeline‑style separation of data loading and training phases to maximize GPU utilization.

3.5 Test Results

Benchmarks show that using RapidFS or PFS with Bucket Link fully saturates GPU utilization, whereas direct object‑storage training suffers from low GPU usage due to I/O bottlenecks.

4. Customer Case Study

An autonomous‑driving customer collects petabytes of road‑scene data using specialized hardware (“Moonlight Box”), uploads it to BOS via network or physical shipment, and then uses PFS to feed large‑scale GPU training clusters. The solution demonstrates end‑to‑end data ingestion, storage, training, and model deployment.

Additional cloud services (IaaS, PaaS, SaaS) are also leveraged, reflecting Baidu’s extensive internal practice turned into product offerings.

cloud storage data lake AI training parallel file system high-performance computing

Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

1. Storage Challenges in High‑Performance Computing Scenarios

1.1 What Is High‑Performance Computing?

1.2 Storage Issues in Traditional HPC

1.3 Storage Issues in AI HPC

1.4 Storage Issues in HPDA

1.5 Summary of HPC Storage Requirements

2. Baidu’s Internal High‑Performance Storage Practices

3. Baidu Canghai High‑Performance Storage Solution

3.1 Parallel File System (PFS)

3.2 Distributed Cache Accelerator (RapidFS)

3.3 Efficient Data Transfer

3.4 Unified Scheduling

3.5 Test Results

4. Customer Case Study

Baidu Intelligent Cloud Tech Hub

How this landed with the community

Was this worth your time?

0 Comments

1.3 Storage Issues in AI HPC