Artificial Intelligence 15 min read

How Weibo Boosted Deep Learning Training Speed 18× with Fluid and JindoRuntime

Weibo’s deep learning platform faced severe latency and stability issues when accessing massive small‑file datasets via a compute‑storage‑separated architecture, so the team adopted the CNCF Fluid project with JindoRuntime, implementing a distributed cache that leverages POSIX interfaces, dramatically improving data locality, reducing HDFS load, and achieving up to 18‑fold training speedups while raising success rates from 37 % to 98 %.

Alibaba Cloud Native

Jun 3, 2021

How Weibo Boosted Deep Learning Training Speed 18× with Fluid and JindoRuntime

Background

Weibo’s deep learning platform processes billions of daily posts and relies on a compute‑storage‑separated architecture. While this decouples compute from storage for flexible scaling, it introduces high data‑access latency and stability issues for massive small‑file workloads.

Challenges of Large‑Scale Model Training

High latency of compute‑storage separation : Accessing millions of small files via HDFS is 10‑100× slower than local reads.

Kubernetes scheduler unaware of data cache : Repeated accesses (e.g., hyper‑parameter sweeps, fine‑tuning, AutoML) cannot reuse cached data because the native scheduler does not consider cache locality.

Frameworks lack HDFS support : Popular frameworks such as PyTorch and MXNet only support POSIX interfaces, requiring extra development to handle HDFS.

HDFS becomes a performance bottleneck : Hundreds of GPU nodes concurrently access HDFS, causing severe I/O pressure and reducing training stability.

Goals for the New Architecture

Enable local‑data access to eliminate repeated network reads and improve GPU utilization.

Reduce HDFS load by serving a portion of data from local caches.

Leverage cache‑node advantages transparently, scheduling tasks to nodes that already hold the required data.

Provide a unified POSIX interface for both development and training phases, simplifying code.

Why Choose Fluid + JindoRuntime

Fluid orchestrates data and compute on Kubernetes, exposing a PersistentVolumeClaim interface that integrates seamlessly with existing workloads.

JindoRuntime implements a distributed cache based on JindoFS, offering high‑performance small‑file access and compatibility with HDFS, OSS, S3, etc.

Hierarchical metadata and slab‑allocation enable efficient small‑file lookup and optimal cache utilization.

Data‑aware scheduling automatically places jobs on cache‑enabled nodes without user intervention.

Different caching strategies for large and small files adapt automatically to AI training scenarios.

Architecture Components

Fluid

Fluid is a cloud‑native, scalable data orchestration and acceleration system that runs on Kubernetes. It solves data‑access latency, multi‑source data federation, and complex data‑usage workflows.

JindoRuntime

JindoRuntime is Fluid’s distributed cache runtime built on JindoFS, a high‑performance storage engine compatible with Hadoop File System interfaces. It caches remote files, supports multiple storage backends, and offers POSIX‑compatible access via FUSE.

Practical Deployment

Select appropriate cache nodes : Use nodeAffinity to schedule datasets onto nodes with large disks and high‑speed network interfaces.

Master scheduling strategy : Deploy a reliable master component with nodeSelector to ensure stability; a single‑master setup proved robust when host machines are healthy.

Periodic data pre‑warming : Use Fluid’s CRD and a Kubernetes CronJob to preload metadata and data before training, reducing start‑up latency. Incremental sync further speeds up subsequent runs.

Performance Test Plan

The solution was evaluated with a video‑understanding model (mmaction) trained on 4 M raw frames (~780 GB) across multiple GPU nodes. Experiments scaled from 1‑machine‑8‑GPU to 3‑machine‑12‑GPU configurations, comparing the traditional HDFS interface with Fluid + JindoRuntime.

Performance Test Results

With data pre‑warming, Fluid + JindoRuntime achieved dramatic speedups: 1‑machine‑4‑GPU saw a 5× improvement, 2‑machine‑8‑GPU a 9× improvement, and 3‑machine‑12‑GPU an 18× improvement. Training time dropped from 389 hours (16 days) to 16 hours, and success rates rose from 37.1 % to 98.3 %.

Conclusion

Integrating Fluid + JindoRuntime significantly improves performance and stability for small‑file training scenarios, delivering up to 18× speedup and reducing HDFS pressure. The solution scales with data volume (currently 4 TB and growing) and raises overall training success rates.

Future Work

Support dynamic scaling of scheduled tasks.

Enhance data pre‑warming and metadata backup for rapid dataset reconstruction.

Provide a performance monitoring dashboard.

Ensure high availability and seamless upgrades of the runtime.

Manage the full lifecycle of multiple datasets in large‑scale Kubernetes clusters.