Cloud Native 21 min read

Best Practices for Data Acceleration, Stability, and Consistency with Alibaba Cloud ACK Fluid

This guide details how to use Alibaba Cloud ACK Fluid to accelerate data access, improve system stability, and ensure cache consistency across AI, big‑data, and analytics workloads by selecting appropriate ECS instances, cache media, scheduling affinity, and runtime configurations.

Alibaba Cloud Infrastructure

Aug 1, 2024

Best Practices for Data Acceleration, Stability, and Consistency with Alibaba Cloud ACK Fluid

In the era of large models, rapid advances in AIGC and LLM technologies create significant data‑processing challenges, especially for training, inference, and big‑data analysis. Simply adding a cache layer does not guarantee performance gains; careful configuration is required.

Performance optimization best practices include selecting suitable ECS instance types and cache media (memory, local HDD/SSD) for Fluid’s distributed cache, calculating cache capacity and bandwidth with formulas, and configuring tiered storage levels. Example YAML for a memory‑based tiered store:

spec:</code><code>  tieredstore:</code><code>    levels:</code><code>      - mediumtype: MEM</code><code>        volumeType: emptyDir</code><code>        path: /dev/shm</code><code>        quota: 30Gi # per Worker cache capacity</code><code>        high: "0.95"</code><code>        low: "0.7"

For SSD‑based storage, adjust mediumtype to SSD and set volumeType to hostPath with appropriate paths and quotas.

spec:</code><code>  tieredstore:</code><code>    levels:</code><code>      - mediumtype: SSD</code><code>        volumeType: hostPath</code><code>        path: /mnt/disk1</code><code>        quota: 100Gi</code><code>        high: "0.95"</code><code>        low: "0.7"

When multiple local disks are used, list them in path (e.g., /mnt/disk1,/mnt/disk2) and the quota is split across the disks.

Scheduling affinity ensures cache Workers and application Pods are placed in the same availability zone to reduce cross‑zone latency. Example Dataset affinity configuration:

apiVersion: data.fluid.io/v1alpha1</code><code>kind: Dataset</code><code>metadata:</code><code>  name: demo-dataset</code><code>spec:</code><code>  nodeAffinity:</code><code>    required:</code><code>      nodeSelectorTerms:</code><code>        - matchExpressions:</code><code>            - key: topology.kubernetes.io/zone</code><code>              operator: In</code><code>              values:</code><code>                - <ZONE_ID> # e.g. cn-beijing-i

Stability best practices recommend persisting cache master metadata on ESSD volumes, configuring sufficient memory limits for FUSE Pods, and enabling the FUSE self‑healing feature so that applications do not need to restart when the FUSE process crashes.

Example JindoRuntime master persistence:

apiVersion: data.fluid.io/v1alpha1</code><code>kind: JindoRuntime</code><code>metadata:</code><code>  name: sd-dataset</code><code>spec:</code><code>  volumes:</code><code>    - name: meta-vol</code><code>      persistentVolumeClaim:</code><code>        claimName: demo-jindo-master-meta</code><code>  master:</code><code>    resources:</code><code>      requests:</code><code>        memory: 4Gi</code><code>      limits:</code><code>        memory: 8Gi</code><code>    volumeMounts:</code><code>      - name: meta-vol</code><code>        mountPath: /root/jindofs-meta</code><code>    properties:</code><code>      namespace.meta-dir: "/root/jindofs-meta"

FUSE resource configuration (recommended high memory limit):

spec:</code><code>  fuse:</code><code>    resources:</code><code>      requests:</code><code>        memory: 8Gi</code><code>    # limits:</code><code>    #   memory: <ECS_ALLOCATABLE_MEMORY>

Cache read/write consistency strategies depend on workload patterns. For read‑only datasets, the default Fluid configuration suffices. For read‑write scenarios, separate Datasets can be created for read and write paths, or access modes can be set to ReadWriteMany. Example read‑write Dataset:

apiVersion: data.fluid.io/v1alpha1</code><code>kind: Dataset</code><code>metadata:</code><code>  name: model-ckpt</code><code>spec:</code><code>  accessModes: ["ReadWriteMany"]

When using JindoRuntime, fine‑tune FUSE attribute timeouts to balance consistency and performance:

spec:</code><code>  fuse:</code><code>    args:</code><code>      - -oauto_cache</code><code>      - -oattr_timeout=30</code><code>      - -oentry_timeout=30</code><code>      - -onegative_timeout=30</code><code>    properties:</code><code>      fs.jindofsx.meta.cache.enable: "false"

Overall, by selecting appropriate ECS specs, cache media, affinity rules, persistence settings, and runtime parameters, users can achieve optimal data acceleration, high availability, and suitable consistency guarantees for AI training, inference, and big‑data analytics workloads on Alibaba Cloud ACK Fluid.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud-native Big Data Kubernetes Data Caching ACK Fluid

Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.