Cloud Native 18 min read

Using Fluid Cloud‑Native Data Caching to Boost Performance and Elasticity of a Quantitative Research Platform on Alibaba Cloud

This article describes how JoinQuant built a cloud‑native quantitative research platform on Alibaba Cloud, identified performance, cost, data‑management, and security challenges, and solved them with Fluid’s JindoRuntime data‑caching, elastic scaling, and Python‑driven workflows, achieving dramatic speed and cost improvements.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Using Fluid Cloud‑Native Data Caching to Boost Performance and Elasticity of a Quantitative Research Platform on Alibaba Cloud

Background

Quantitative investment relies on data‑driven decisions, and JoinQuant uses large‑scale market data, AI, and automated trading. Their research workflow includes factor mining, return prediction, portfolio optimization, and back‑testing, all of which are data‑intensive tasks running on Alibaba Cloud services such as ECS, ECI, ACK, NAS, and OSS.

Challenges

The platform faced performance bottlenecks, high and variable bandwidth costs, complex data management across NAS and OSS, data‑security isolation, a steep learning curve for Kubernetes/YAML, and the need for dynamic data‑source mounting without restarting Jupyter notebooks.

Solution Overview

JoinQuant discovered that the native Kubernetes CSI could not meet multi‑source acceleration needs, so they adopted the CNCF Fluid project. Fluid provides a unified way to manage and accelerate multiple Persistent Volume Claims (PVCs) from OSS and NAS, with JindoRuntime offering the best performance and stability.

Key capabilities include:

Per‑data‑type storage policies (read‑only for training data, read‑write for feature data and checkpoints) via Fluid Datasets.

Elastic scaling of cache workers on high‑IO, large‑memory ECS/ECI instances, including Spot instances for cost savings.

Scheduled cache scaling (CronHorizontalPodAutoscaler) to match workload tides.

Data pre‑warming and metadata synchronization to keep caches up‑to‑date.

Namespace‑based isolation for secure multi‑team data access while allowing shared public datasets.

Python SDK for end‑to‑end dataset creation, runtime binding, cache scaling, and pre‑loading, eliminating the need for YAML.

Implementation Details

Example Fluid Dataset definitions (YAML) and Python SDK usage are shown below. Code snippets are kept intact:

apiVersion: data.fluid.io/v1alpha1</code><code>kind: Dataset</code><code>metadata:</code><code>  name: training-data</code><code>spec:</code><code>  mounts:</code><code>    - mountPoint: "pvc://nas/training-data"</code><code>      path: "/training-data"</code><code>  accessModes: ReadOnlyMany</code><code>---</code><code>apiVersion: data.fluid.io/v1alpha1</code><code>kind: Dataset</code><code>metadata:</code><code>  name: checkpoint</code><code>spec:</code><code>  mounts:</code><code>    - mountPoint: "pvc://nas/checkpoint"</code><code>      path: "/checkpoints"</code><code>  accessModes: ReadWriteMany

Python example:

import fluid</code><code>from fluid import constants, models</code><code># Connect to the cluster</code><code>client_config = fluid.ClientConfig()</code><code>fluid_client = fluid.FluidClient(client_config)</code><code># Create a read‑only Dataset</code><code>fluid_client.create_dataset(dataset_name="mydata", mount_name="/", mount_point="pvc://static-pvc-nas/mydata")</code><code># Bind JindoRuntime and scale cache</code><code>dataflow = dataset.bind_runtime(runtime_type=constants.JINDO_RUNTIME_KIND, replicas=1, cache_capacity_GiB=30, cache_medium="MEM", wait=True).scale_cache(replicas=2).preload(target_path="/train")</code><code># Run the dataflow</code><code>run = dataflow.run()</code><code>run.wait()

Performance Evaluation

Tests with up to 100 concurrent Pods showed Fluid reducing average data‑access time from 15 minutes to 38.5 seconds and cutting compute cost to one‑tenth, thanks to bandwidth scaling with JindoRuntime replicas.

Summary and Outlook

Fluid provides elastic, high‑performance data caching that integrates with Kubernetes scaling, enabling flexible, cost‑effective quantitative research. Future work includes tighter coupling of task and cache elasticity, and improving Dataflow data‑affinity for better node locality.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeKuberneteselastic scalingData CachingFluidQuantitative Research
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.