High‑Performance Computing Storage Challenges and Baidu Canghai Storage Solutions
This article explains the storage problems faced by traditional HPC, AI‑driven HPC and high‑performance data analysis, describes Baidu's internal high‑performance storage practices, and introduces the Baidu Canghai solution—including object storage BOS, parallel file system PFS, RapidFS, data‑flow mechanisms and a customer case—demonstrating how these technologies meet the demanding throughput, latency and cost requirements of modern high‑performance workloads.
1. Storage Issues in High‑Performance Computing
High‑Performance Computing (HPC) encompasses traditional supercomputing, AI‑driven HPC and high‑performance data analysis (HPDA), each presenting distinct storage challenges such as random I/O inefficiency, process coordination, massive small‑file metadata overhead, and stringent throughput and latency demands.
1.1 What Is HPC?
HPC refers to supercomputers that deliver performance one to two orders of magnitude higher than contemporary personal computers, used in scientific simulation, weather forecasting, and increasingly in industry‑wide scenarios.
1.2 Traditional HPC Storage Problems
Random small I/O caused by scattered matrix fragments leads to poor I/O efficiency.
Process coordination is required to ensure all nodes finish before data can be persisted.
Two‑stage I/O aggregates small requests into large sequential I/O, relying on POSIX file interfaces and MPI‑I/O.
1.3 AI HPC Storage Problems
AI training workloads generate massive read I/O for small files, require checkpointing, and benefit from POSIX, K8s CSI, and optionally GPU Direct Storage interfaces; metadata performance is critical due to the prevalence of tiny files.
1.4 HPDA Storage Problems
HPDA workloads are dominated by large files, demanding very high throughput but tolerating higher latency; they typically use Hadoop‑compatible HCFS interfaces.
1.5 Summary of HPC Storage Requirements
Common needs include high throughput, low latency (for HPC/AI HPC), support for massive small‑file metadata, POSIX compatibility, MPI‑I/O for traditional HPC, and cost‑effective, reliable storage.
2. Baidu’s Internal High‑Performance Storage Practice
Baidu operates a unified storage base that provides high reliability, low cost, and high throughput, supporting POSIX and HCFS interfaces and offering SDKs for custom development.
Two runtime storage solutions address different scenarios:
Local‑disk or parallel file system (PFS) for small‑file intensive AI training, delivering fast metadata and I/O performance.
Direct access to the storage base for long‑running, throughput‑critical jobs.
A distributed training platform abstracts mounting, capacity allocation, and data movement, simplifying user experience.
3. Baidu Canghai High‑Performance Storage Solution
The solution combines the object storage service BOS (with tiered storage and lifecycle management) as the storage base with two runtime systems: PFS (a Lustre‑like parallel file system) and RapidFS (a cache‑accelerated system).
3.1 Parallel File System PFS
PFS provides a short I/O path via dedicated metadata (MDS) and data (OSS) nodes, deployed close to compute nodes using RDMA or high‑speed TCP.
3.2 Distributed Cache Accelerated RapidFS
RapidFS leverages idle memory and disk on compute nodes to form a P2P cache, offering hierarchical namespace caching and data caching to accelerate access to BOS.
3.3 Efficient Data Transfer
Lifecycle policies automatically migrate cold data from PFS to BOS, reducing cost.
Bucket Link binds PFS/RapidFS namespaces to BOS paths, enabling seamless data pre‑loading and hot‑data caching.
3.4 Unified Scheduling
Bucket Link is integrated into Kubernetes via the open‑source Fluid project, allowing data‑preload pipelines to run in parallel with GPU training, improving GPU utilization.
3.5 Test Results
Experiments show that using RapidFS or PFS with Bucket Link achieves near‑100% GPU utilization, whereas direct BOS access leaves GPUs under‑utilized due to I/O bottlenecks.
3.6 Customer Case
A leading autonomous‑driving client collects petabyte‑scale road‑test data, uploads it via Baidu’s “Moonlight Box” hardware to BOS, and uses PFS to feed massive GPU clusters for model training, completing the data‑collect‑train‑iterate loop efficiently.
—END—
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.