How Weibo Boosted Deep Learning Training Speed 18× with Fluid and JindoRuntime
Weibo’s deep learning platform faced severe latency and stability issues when accessing massive small‑file datasets via a compute‑storage‑separated architecture, so the team adopted the CNCF Fluid project with JindoRuntime, implementing a distributed cache that leverages POSIX interfaces, dramatically improving data locality, reducing HDFS load, and achieving up to 18‑fold training speedups while raising success rates from 37 % to 98 %.
Background
Weibo’s deep learning platform processes billions of daily posts and relies on a compute‑storage‑separated architecture. While this decouples compute from storage for flexible scaling, it introduces high data‑access latency and stability issues for massive small‑file workloads.
Challenges of Large‑Scale Model Training
High latency of compute‑storage separation : Accessing millions of small files via HDFS is 10‑100× slower than local reads.
Kubernetes scheduler unaware of data cache : Repeated accesses (e.g., hyper‑parameter sweeps, fine‑tuning, AutoML) cannot reuse cached data because the native scheduler does not consider cache locality.
Frameworks lack HDFS support : Popular frameworks such as PyTorch and MXNet only support POSIX interfaces, requiring extra development to handle HDFS.
HDFS becomes a performance bottleneck : Hundreds of GPU nodes concurrently access HDFS, causing severe I/O pressure and reducing training stability.
Goals for the New Architecture
Enable local‑data access to eliminate repeated network reads and improve GPU utilization.
Reduce HDFS load by serving a portion of data from local caches.
Leverage cache‑node advantages transparently, scheduling tasks to nodes that already hold the required data.
Provide a unified POSIX interface for both development and training phases, simplifying code.
Why Choose Fluid + JindoRuntime
Fluid orchestrates data and compute on Kubernetes, exposing a PersistentVolumeClaim interface that integrates seamlessly with existing workloads.
JindoRuntime implements a distributed cache based on JindoFS, offering high‑performance small‑file access and compatibility with HDFS, OSS, S3, etc.
Hierarchical metadata and slab‑allocation enable efficient small‑file lookup and optimal cache utilization.
Data‑aware scheduling automatically places jobs on cache‑enabled nodes without user intervention.
Different caching strategies for large and small files adapt automatically to AI training scenarios.
Architecture Components
Fluid
Fluid is a cloud‑native, scalable data orchestration and acceleration system that runs on Kubernetes. It solves data‑access latency, multi‑source data federation, and complex data‑usage workflows.
JindoRuntime
JindoRuntime is Fluid’s distributed cache runtime built on JindoFS, a high‑performance storage engine compatible with Hadoop File System interfaces. It caches remote files, supports multiple storage backends, and offers POSIX‑compatible access via FUSE.
Practical Deployment
Select appropriate cache nodes : Use nodeAffinity to schedule datasets onto nodes with large disks and high‑speed network interfaces.
Master scheduling strategy : Deploy a reliable master component with nodeSelector to ensure stability; a single‑master setup proved robust when host machines are healthy.
Periodic data pre‑warming : Use Fluid’s CRD and a Kubernetes CronJob to preload metadata and data before training, reducing start‑up latency. Incremental sync further speeds up subsequent runs.
Performance Test Plan
The solution was evaluated with a video‑understanding model (mmaction) trained on 4 M raw frames (~780 GB) across multiple GPU nodes. Experiments scaled from 1‑machine‑8‑GPU to 3‑machine‑12‑GPU configurations, comparing the traditional HDFS interface with Fluid + JindoRuntime.
Performance Test Results
With data pre‑warming, Fluid + JindoRuntime achieved dramatic speedups: 1‑machine‑4‑GPU saw a 5× improvement, 2‑machine‑8‑GPU a 9× improvement, and 3‑machine‑12‑GPU an 18× improvement. Training time dropped from 389 hours (16 days) to 16 hours, and success rates rose from 37.1 % to 98.3 %.
Conclusion
Integrating Fluid + JindoRuntime significantly improves performance and stability for small‑file training scenarios, delivering up to 18× speedup and reducing HDFS pressure. The solution scales with data volume (currently 4 TB and growing) and raises overall training success rates.
Future Work
Support dynamic scaling of scheduled tasks.
Enhance data pre‑warming and metadata backup for rapid dataset reconstruction.
Provide a performance monitoring dashboard.
Ensure high availability and seamless upgrades of the runtime.
Manage the full lifecycle of multiple datasets in large‑scale Kubernetes clusters.
Related Links
Fluid: https://github.com/fluid-cloudnative/fluid
JindoFS: https://github.com/aliyun/alibabacloud-jindofs
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
