Artificial Intelligence 24 min read

How Bilibili Scaled AI Model Training with Alluxio Cache Acceleration

This article details Bilibili's multi-layer storage architecture and Alluxio‑based cache acceleration for large‑scale AI model training, covering challenges of high‑throughput, low‑latency file access, metadata scalability, fault tolerance, and the engineering solutions that boosted I/O performance up to ten‑fold.

Bilibili Tech

Aug 12, 2025

How Bilibili Scaled AI Model Training with Alluxio Cache Acceleration

1. Background

As model training moves to large‑scale, improving training efficiency and reducing compute cost have become key breakthroughs worldwide. In cluster‑based model training, the underlying storage must simultaneously provide high‑throughput low‑latency access to billions of files, petabyte‑level reliable storage, and system‑wide high‑availability.

1.1 Bilibili Model Training Storage Architecture

Figure 1‑1 shows Bilibili’s three‑layer design:

Business Application Layer: model training, content safety review, search recommendation, e‑commerce advertising and other core scenarios.

Machine Learning Platform Layer: provides end‑to‑end lifecycle management from data preprocessing to model training and inference deployment.

Infrastructure Storage Layer: integrates multiple storage services to form a complementary ecosystem, including HDFS (EB‑scale reliable storage), NAS (low‑latency shared access) and BOSS object storage (self‑developed distributed object store).

1.2 Model Training Data Processing

The storage requirements focus on two stages:

Data aggregation & preprocessing: high‑throughput, large‑capacity storage and a unified high‑speed file access interface.

Model training: massive small‑file low‑latency reads and high‑frequency checkpoint writes with sub‑millisecond latency.

Overall demands are capacity, I/O performance, access‑cost reduction via a unified POSIX layer, and system stability.

Existing HDFS clusters (EB‑scale) suffer from HDD throughput, high DataNode load, HDD failure rate (~0.05%), and poor compatibility with training frameworks.

2. Solution Selection

2.1 Solution Comparison

Benchmarking of internal NAS, industry PFS, and the open‑source 3fs showed that all‑SSD solutions meet I/O needs but are cost‑prohibitive at petabyte scale. PFS with hot‑cold tiering also introduced complexity, consistency, and operational costs.

We selected a cache‑accelerated approach using Alluxio, which reuses idle SSDs in the existing cluster, avoids extra hardware purchase, and simplifies management.

2.2 Alluxio‑Based Storage Acceleration

Alluxio 2.9.4 is deployed as a cache cluster; workers run on idle SSDs of the HDFS cluster, and the system supports both BOSS object storage and HDFS as back‑ends. Training accesses data in two steps: (1) data pre‑loading into Alluxio cache, and (2) transparent access via Alluxio FUSE mount. This leverages idle SSD resources, adds near‑zero hardware cost, and improves throughput by 6.3× for small‑file reads, achieving roughly 10× I/O gain and a 2.5× overall training speedup.

Figure 2‑3 shows the Alluxio vs HDFS throughput comparison.

Introducing Alluxio also brings new challenges:

Metadata service bottleneck: a single Master struggles with billions of metadata entries.

Increased fault domain due to more components.

Higher fault‑tolerance requirements for remote worker access.

3. Challenges and Countermeasures

3.1 System Stability Assurance & Optimization

3.1.1 Master Node Stability

Alluxio uses a Master/Slave architecture. We optimized the checkpoint mechanism with a dual‑threshold trigger (6 h interval >10 k entries or any time >2 M entries) and separated journal, audit, and metastore logs onto dedicated NVMe disks, reducing cold‑start time from ~25 min to ~5 min.

We also enabled workers to push block metadata to all Masters, cutting master‑switch downtime from minutes to seconds.

Metadata consistency issues after master switch were solved by adding journal synchronization for block updates.

3.1.2 Worker Node Stability

Worker OOM caused by repeated Load tasks was fixed by synchronizing job status via journal entries and checking task state during master failover.

Memory fragmentation in NioDirectBufferPool was addressed with a zero‑occupancy buffer cleaner and an 80 % memory‑usage threshold that triggers buffer merging, eliminating OOM in large Load scenarios.

3.1.3 Data Read Fault‑Tolerance

We built a multi‑level fallback in the Fuse client: when Alluxio anomalies are detected, the client seamlessly switches to the underlying storage using the last known offset, isolating degraded traffic and gradually restoring the cache path.

3.1.4 Metadata Consistency Optimization

We reduced sync frequency to 10 min, routed heavy operations (rename, delete, ls) directly to the underlying file system, cached listStatus results, and used a thread pool of HDFS clients to avoid lock contention, achieving ~4 ms average sync latency and 2× small‑file read improvement.

3.2 Metadata Capacity & Scalability

When a single Alluxio cluster exceeds 300 M files, metadata becomes a bottleneck. We introduced a federated routing architecture: a Router maintains dynamic path‑to‑cluster rules, Fuse clients embed routing decisions, and new clusters register path rules for seamless integration.

To further reduce metadata load, we implemented small‑file folding: files <4 MB are aggregated into GB‑scale containers with an index block, cutting metadata volume by 98.3 % (from 58 M to 0.56 M files).

3.3 Write Stability & Consistency Balance

Single‑replica writes in Alluxio cause checkpoint failures when a worker crashes. We introduced a direct‑write path to HDFS: high‑priority checkpoints go to SSD‑backed HDFS, while preprocessing data writes to HDD‑backed HDFS with native replication, providing both low latency and high durability.

3.4 Platform Features & Ecosystem Integration

We built an Alluxio cache management platform offering unified multi‑source storage access, cache lifecycle work‑flows, quota enforcement, and visual monitoring. Automated data pre‑load/eviction, smart quota control, and fine‑grained operation analytics improve resource efficiency.

Dynamic directory‑sharding sync further mitigates metadata service slowdown when syncing millions of files.

4. Summary and Future Plans

Alluxio clusters now reliably support AI model training workloads, delivering up to ten‑fold I/O gains and stable checkpointing. Future work includes refining cache eviction based on data replaceability, enhancing Fuse with multi‑path I/O and shared‑memory support, and delivering an end‑to‑end storage acceleration solution covering data preprocessing, distributed training, and inference.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance AI caching storage model training Alluxio

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.