How Baidu’s Cloud Storage Powers High‑Performance Computing and AI Workloads
This article explains the storage challenges of high‑performance computing—including traditional HPC, AI‑driven HPC, and HPDA—then details Baidu’s unified storage platform, object storage BOS, and runtime solutions PFS and RapidFS, illustrating their architecture, features, and a real‑world autonomous‑driving customer case.
1. Storage Challenges in High‑Performance Computing Scenarios
High‑Performance Computing (HPC) encompasses three main categories: traditional supercomputing, AI‑driven HPC (AI HPC), and High‑Performance Data Analytics (HPDA), each presenting distinct storage requirements such as high throughput, low latency, and handling massive small files.
1.1 What Is High‑Performance Computing?
HPC refers to computing workloads that require performance far beyond typical personal computers, often using clusters of servers. Recent trends show expanding application domains, increasing adoption of cloud‑based HPC, and growing convergence with AI and big‑data workloads.
Trend 1: More diverse HPC use cases across industries.
Trend 2: Migration of HPC workloads to the cloud for elastic resource provisioning.
Trend 3: Cross‑domain collaboration between HPC, AI, and big‑data technologies.
1.2 Storage Issues in Traditional HPC
Traditional HPC workloads generate large matrix files that are accessed randomly by many processes, leading to inefficient small I/O operations and the need for coordinated process synchronization.
I/O efficiency problem: scattered random reads/writes degrade performance, especially on mechanical disks.
Process coordination problem: all processes must signal completion before the overall computation can finish.
Two‑stage I/O aggregation is used to combine many small I/O requests into larger sequential operations, typically implemented via the MPI‑I/O framework on top of POSIX file interfaces.
1.3 Storage Issues in AI HPC
AI training involves repeated data loading and checkpointing. Large numbers of small files (e.g., image datasets) stress metadata performance, while high‑throughput GPU training demands POSIX and K8s CSI compatible storage, optionally leveraging GPU Direct Storage.
1.4 Storage Issues in HPDA
HPDA workloads, such as large‑scale MapReduce jobs, primarily handle big files and require very high throughput but are less sensitive to latency. Object storage with HCFS compatibility is commonly used.
1.5 Summary of HPC Storage Requirements
Across all HPC categories, the common needs are high throughput, low latency (for HPC and AI HPC), support for massive small‑file metadata operations (AI HPC), POSIX and MPI‑I/O interfaces, data durability, and cost‑effective storage tiers.
2. Baidu’s Internal High‑Performance Storage Practices
Baidu operates a unified storage platform that provides high reliability, low cost, and high throughput, supporting POSIX, HCFS, and custom SDKs. It powers diverse workloads such as autonomous driving, speech recognition, and ad recommendation.
Two runtime storage solutions are offered:
Local‑disk or parallel file system (PFS) for workloads with many small files, delivering high metadata performance.
Direct access to the storage base for long‑running, throughput‑focused jobs.
The platform also automates data movement, mounting, and capacity provisioning, simplifying user interaction.
3. Baidu Canghai High‑Performance Storage Solution
The solution combines a large‑capacity, high‑throughput, low‑cost storage base (BOS) with fast runtime stores PFS and RapidFS.
3.1 Parallel File System (PFS)
PFS is a Lustre‑like parallel file system that provides one‑hop I/O paths via dedicated metadata (MDS) and data (OSS) nodes, deployed close to compute resources using RDMA or high‑speed TCP.
3.2 Distributed Cache Accelerator (RapidFS)
RapidFS leverages idle memory and disk on compute nodes to create a peer‑to‑peer cache, offering hierarchical namespace caching and data caching to accelerate access to object storage.
3.3 Efficient Data Transfer
Data movement between storage base, PFS, and RapidFS is handled by lifecycle policies (automatic tiering) and Bucket Link, which binds a PFS/RapidFS namespace to an object‑storage path for seamless pre‑loading.
3.4 Unified Scheduling
Bucket Link is integrated with Kubernetes via the open‑source Fluid project, enabling pipeline‑style separation of data loading and training phases to maximize GPU utilization.
3.5 Test Results
Benchmarks show that using RapidFS or PFS with Bucket Link fully saturates GPU utilization, whereas direct object‑storage training suffers from low GPU usage due to I/O bottlenecks.
4. Customer Case Study
An autonomous‑driving customer collects petabytes of road‑scene data using specialized hardware (“Moonlight Box”), uploads it to BOS via network or physical shipment, and then uses PFS to feed large‑scale GPU training clusters. The solution demonstrates end‑to‑end data ingestion, storage, training, and model deployment.
Additional cloud services (IaaS, PaaS, SaaS) are also leveraged, reflecting Baidu’s extensive internal practice turned into product offerings.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
