Industry Insights 31 min read

How Baidu’s Canghai Storage Powers High‑Performance Computing: Challenges and Solutions

This article analyzes the storage challenges of high‑performance computing—including traditional HPC, AI‑driven HPC, and high‑performance data analysis—examines Baidu’s internal practices, and presents the Canghai storage platform with its object storage, parallel file system (PFS) and RapidFS solutions that address throughput, latency, and scalability requirements.

Baidu Geek Talk

Jul 26, 2022

How Baidu’s Canghai Storage Powers High‑Performance Computing: Challenges and Solutions

Storage Issues in High‑Performance Computing (HPC)

HPC refers to clusters that deliver performance one to two orders of magnitude higher than a typical PC. Modern workloads span traditional scientific simulation, AI‑driven deep‑learning training (AI‑HPC), and high‑performance data analysis (HPDA) such as large‑scale genomics.

Traditional HPC

Scientific codes often split a massive matrix into sub‑matrices processed by many MPI ranks. The matrix is stored as a single large file, which creates two problems:

I/O efficiency: each rank reads or writes small, random fragments, leading to many tiny I/Os that perform poorly on mechanical disks and even on SSDs.

Process coordination: all ranks must finish before the global result is valid, requiring a synchronization mechanism.

Two‑stage I/O aggregation mitigates these issues. A subset of ranks act as I/O aggregators, buffer data in memory, and perform large sequential writes via MPI‑IO. Required interfaces are POSIX file access and MPI‑IO support, with target throughput of tens to hundreds of GB/s and latency low enough to keep CPUs/GPUs busy.

AI‑HPC

Deep‑learning training proceeds in many epochs. Each epoch reads training samples (often millions of tiny files) via a data‑loader, performs GPU‑accelerated computation, and occasionally writes checkpoints for fault tolerance.

Read‑heavy workload: massive small‑file reads dominate I/O.

Checkpoint writes: relatively infrequent but require high sequential bandwidth.

Interface requirements: POSIX is still the dominant API; Kubernetes CSI integration is needed for containerised training. GPU Direct Storage (GDS) is an emerging option that lets GPUs read storage directly, bypassing CPU‑GPU copies.

Metadata pressure: the sheer number of files makes metadata performance a critical factor.

HPDA (High‑Performance Data Analysis)

HPDA workloads (e.g., population‑scale genome sequencing) process very large files, are less latency‑sensitive, but demand extremely high sustained throughput. Object storage with Hadoop‑compatible (HCFS) interfaces is commonly used, providing POSIX‑like semantics on top of a cloud data lake.

Common Storage Requirements

High aggregate throughput (≥ tens of GB/s) and, for HPC/AI‑HPC, low latency to keep compute resources saturated.

Excellent metadata scalability for massive small‑file workloads.

Primary POSIX file interface; HCFS as a POSIX‑compatible subset for big‑data pipelines.

Support for temporary, low‑cost storage for intermediate results.

Cost‑effective scaling to petabyte‑scale data volumes.

Baidu’s Internal High‑Performance Storage Practices

Baidu’s production workloads are dominated by AI‑HPC and HPDA (e.g., autonomous driving, speech recognition, recommendation). To satisfy diverse throughput, latency, and file‑size characteristics, Baidu built a unified storage “base” that provides:

POSIX file access and HCFS compatibility.

SDKs for custom high‑performance interfaces.

Two runtime storage options:

Both integrate with Kubernetes CSI and the open‑source Fluid project to separate data‑loading and training phases, improving GPU utilisation.

Baidu Canghai High‑Performance Storage Solution

Parallel File System (PFS)

PFS is deployed inside user VPCs and uses RDMA or high‑speed TCP for one‑hop communication with metadata servers (MDS) and object storage servers (OSS). The architecture yields a one‑hop I/O path, short latency, and high aggregate bandwidth, and it is exposed to Kubernetes via a CSI driver.

RapidFS – Distributed Cache Acceleration

RapidFS turns idle memory and local SSDs on compute nodes into a peer‑to‑peer cache. It provides two acceleration mechanisms:

Hierarchical namespace caching: caches the POSIX‑style namespace of Baidu Object Storage (BOS) to reduce metadata latency.

Data caching: hot objects are stored locally, shortening the read path for frequently accessed training samples.

Efficient Data Flow

Data movement between the storage base, PFS, and RapidFS is handled by:

Lifecycle policies: automatically migrate cold data from PFS to lower‑cost object storage and retrieve it on demand.

Bucket Link: binds a PFS or RapidFS directory/namespace to an object‑storage path, enabling one‑click pre‑loading and automatic warm‑up of data.

Unified Scheduling

Bucket Link is integrated into Kubernetes. Job schedulers can trigger data migration automatically. Using Fluid, a training job is split into a data‑loading stage and a compute stage, allowing pipeline parallelism that hides data‑loading latency and maximises GPU utilisation.

Benchmark Results

Experiments comparing three configurations—RapidFS, PFS with Bucket Link, and direct object‑storage training—show that both RapidFS and PFS achieve near‑full GPU utilisation, whereas direct object‑storage training suffers from low GPU usage due to I/O bottlenecks.

Customer Case Study: Autonomous‑Driving Data Pipeline

A leading autonomous‑driving partner collects petabytes of road‑test video using Baidu’s “Moonlight Box” hardware. Data are uploaded to BOS either over the network or by shipping the hardware. The partner stages the data on PFS for large‑scale GPU training, achieving petabyte‑scale capacity and high‑throughput data loading that fully utilises thousands of GPUs.

The workflow forms a closed loop: data acquisition → PFS staging → GPU training → model deployment.

For further technical details, refer to the Baidu Cloud live replay at https://cloud.baidu.com/live/54.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

High-performance computing storage architecture cloud storage AI training parallel file system

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.