How Fluid + JindoRuntime Supercharged Autonomous Driving Model Training

This article details how the Fluid CNCF project combined with JindoRuntime was used to overcome storage‑compute separation bottlenecks in an autonomous‑driving machine‑learning platform, achieving up to 300% faster training, reduced OSS bandwidth pressure, and higher GPU utilization through distributed caching on Kubernetes.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
How Fluid + JindoRuntime Supercharged Autonomous Driving Model Training

Background

Fluid is a CNCF sandbox project that provides cloud‑native data orchestration and acceleration on Kubernetes. It enables co‑placement of data and compute through PersistentVolumeClaims and supports multiple storage back‑ends (OSS, HDFS, S3) via the JindoFS engine.

Challenges in a Storage‑Compute Separated ML Platform

High latency when training jobs repeatedly read data from OSS, causing slow GPU utilization.

Kubernetes scheduler is unaware of cached data, so pods are scheduled to nodes without local copies.

Concurrent OSS accesses become a bandwidth bottleneck under heavy parallel training.

Training files are scattered across many directories, leading to massive metadata list operations and timeouts.

Solution Architecture

Fluid

Runs as a scalable distributed data orchestration layer on Kubernetes. It schedules workloads based on data locality, reducing data‑access latency.

JindoRuntime

JindoRuntime is Fluid’s distributed cache runtime built on the JindoFS engine. JindoFS implements the Hadoop File System interface and supports OSS, HDFS, and S3 protocols. A FUSE‑based POSIX interface lets deep‑learning frameworks (e.g., PyTorch) read cached OSS files without code changes.

Key design choices:

Co‑placement of data and compute via PersistentVolumeClaim.

Separate warm‑up for metadata and data caches.

Fine‑grained file‑list caching to increase cache‑hit rates.

Data‑aware scheduling that automatically places pods on nodes holding the required cache.

Implementation Details

Cache‑node selection : Use nodeAffinity on the Fluid Dataset to bind cache pods to nodes with high‑capacity disks and fast network interfaces.

Cache capacity and tiering : Define cache directories and size limits in the Dataset’s mounts and JindoRuntime’s tieredstore. Tieredstore supports multiple storage tiers (SSD, MEM, HDD) and water‑mark policies to evict excess data.

Secure OSS credentials : Store accessKeyId and accessKeySecret in a Kubernetes Secret and reference it via EncryptOptions in the Dataset.

Data pre‑loading : The dataload feature pre‑loads specified paths and caches their metadata, avoiding unnecessary network transfer for unused files.

Performance Evaluation

Benchmarks comparing runs with and without JindoRuntime show:

Inference on 10,000 images: latency reduced by up to 70% for small models and 50% for larger models.

Training on 4 GPUs with 10,000 images: up to 300% speed‑up, dramatically increasing GPU utilization.

Performance chart
Performance chart

Future Work

Scheduled auto‑scaling tasks for cache nodes.

Performance‑monitoring dashboards.

Lifecycle management for multiple Datasets in large clusters.

Dynamic pruning of cached data and metadata.

References

Fluid project: https://github.com/fluid-cloudnative/fluid JindoFS repository:

https://github.com/aliyun/alibabacloud-jindodata
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance Optimizationmachine learningKubernetesData OrchestrationFluidJindoRuntime
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.