Cloud Native 8 min read

What’s New in Fluid 0.4? DataLoad, Small‑File Boost, HDFS Support & Multi‑Dataset Deployment

Fluid 0.4 introduces a DataLoad custom resource for declarative data pre‑warming, enhances support for massive small‑file datasets, adds HDFS‑compatible access for Spark and other big‑data frameworks, and enables mixed‑deployment of multiple datasets on a single node, all backed by significant performance gains.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
What’s New in Fluid 0.4? DataLoad, Small‑File Boost, HDFS Support & Multi‑Dataset Deployment

New Features in Fluid 0.4

Fluid, the cloud‑native data orchestration platform for AI and big‑data workloads, released version 0.4 with four major enhancements:

DataLoad custom resource for simple, declarative data pre‑warming.

Improved handling of massive small‑file datasets, expanding AI use cases.

HDFS‑compatible interface to support Spark, Hadoop MapReduce, and other frameworks.

Multi‑dataset single‑node mixed deployment for shared‑cluster environments.

DataLoad Custom Resource

DataLoad provides a Kubernetes‑native API to control data pre‑warming. A minimal example:

apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
  name: imagenet-dataload
spec:
  dataset:
    name: imagenet
    namespace: default

Additional configuration enables sub‑directory loading, cache replica control, and metadata synchronization. Detailed usage is documented in the project’s GitHub repository.

Enhanced Small‑File Support

Fluid 0.4 integrates asynchronous metadata loading and streaming data processing to accelerate workloads with millions of tiny files. Benchmark results show dramatic reductions:

Dataset initialization: 60 min → 22 min.

8‑thread parallel read: 407 min → 29 min.

Deep‑learning model training: 6.5 h → 45 min.

Future releases will continue to address the challenges of massive small‑file storage.

HDFS Compatibility for Spark and Other Frameworks

By exposing Alluxio’s HCFS interface, Fluid allows Spark, Hadoop MapReduce, and similar engines to access data without code changes, benefiting from Fluid’s distributed caching and acceleration.

Multi‑Dataset Single‑Node Mixed Deployment

Previously, a single node could host only one dataset. Fluid 0.4 now permits multiple datasets to coexist on the same GPU node, provided resources are sufficient, eliminating deployment conflicts and improving cluster resource utilization.

Conclusion

Fluid 0.4 tackles real‑world feedback by optimizing small‑file performance, simplifying data pre‑warming with DataLoad, extending support for big‑data frameworks, and enabling flexible multi‑dataset deployments, thereby broadening its applicability and enhancing user experience in production environments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeBig DataAIKubernetesAlluxioFluidDataLoad
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.