Cloud Native 16 min read

How to Pick the Best Storage for Kubernetes Workflows: Artifacts vs Volumes

This article examines the storage challenges of Kubernetes‑based Argo Workflows, comparing artifact mechanisms and native volumes, evaluating integrated versus separated compute‑storage architectures, and presenting performance‑oriented optimization techniques for object and file storage in AI and big‑data pipelines.

Alibaba Cloud Infrastructure

Jun 19, 2025

How to Pick the Best Storage for Kubernetes Workflows: Artifacts vs Volumes

Introduction: Data in Workflows

In Kubernetes environments, workflow scenarios are far more sensitive to data access performance than traditional applications; storage system performance directly determines execution efficiency and stability.

Using large‑scale Argo Workflows as an example, elastic scaling and dynamic scheduling demand high concurrency and throughput from the storage layer while keeping costs under control.

Artifacts vs Volumes

Argo Workflow storage read/write is implemented mainly through two mechanisms:

Artifact mechanism . It wraps MinIO or cloud‑provider object‑storage SDKs; users declare input/output paths in the pipeline YAML. Suitable for controlled concurrency and data volume where shared reuse is not required.

Kubernetes native Volume mechanism . Storage media are mounted into Pods, enabling shared storage for short‑lived, high‑concurrency Pods.

The underlying storage architecture differences directly affect volume read/write performance, which is the focus of the following discussion.

Integrated Compute‑Storage vs Separate Compute‑Storage

Two architectural models are considered:

Integrated compute‑storage . In container environments, Pods can access shared local storage (e.g., node‑pooled system disks) via the container network.

Separate compute‑storage . Cloud object storage (e.g., OSS) decouples data from compute, enabling elastic scaling.

Key factors when selecting a storage architecture include:

Scalability and flexibility : Local storage capacity is limited by node count and type; cloud storage offers elastic capacity and pay‑as‑you‑go pricing.

Performance and latency : Local storage provides low latency and high throughput but may be limited by single‑node I/O; cloud storage can be accelerated with caching or specialized services such as CPFS.

Cost and operational overhead : Local storage reuses idle node resources but requires system maintenance; cloud storage incurs data transfer and tiering costs.

For Argo Workflows, the maximum concurrent read/write capability of the storage system must be considered to avoid I/O bottlenecks.

Temporary intermediate data should be isolated on demand (e.g., namespace isolation) and cleaned up via TTL policies or automated scripts to reduce redundancy.

File Storage vs Object Storage

Cloud storage can be classified into block, file (NAS), and object (OSS) types. In strong‑job workflows like Argo, NAS and OSS are commonly used.

NAS : Provides full POSIX semantics, excels in massive small‑file and metadata‑intensive scenarios such as HPC.

OSS : Offers unlimited capacity, low cost, and HTTP access, making it indispensable for big‑data and AI training pipelines (e.g., autonomous driving data ingestion).

OSS: From Performance Bottlenecks to Optimization

Root Causes of Performance Bottlenecks

Object storage’s flat layout conflicts with POSIX file systems; accessing OSS via POSIX requires a FUSE client that translates system calls to HTTP requests, adding protocol overhead.

In Kubernetes, the CSI driver runs the FUSE client, mounts the object store into the pod’s namespace, and only after successful mounting can the workload interact with OSS.

FUSE introduces additional latency, and object storage’s characteristics further exacerbate bottlenecks, such as write‑only overwrite semantics and limited metadata.

Common Performance Optimization Strategies

Modify server‑side storage layout : Store data in formats optimized for OSS while preserving POSIX compatibility.

Use lightweight clients : Focus on sequential reads/writes and omit metadata handling for read‑heavy AI workloads.

Cluster‑side distributed caching : Pool node memory/disk to cache hot data, reducing direct OSS API calls.

Server‑side acceleration : Leverage cloud‑provider acceleration services (e.g., OSS accelerator) at additional cost.

Breaking the FUSE Performance Barrier

A new approach explores virtual block storage with kernel‑mode file systems (e.g., erofs) and UBLK to bypass FUSE, dramatically lowering context‑switch overhead.

This solution is especially effective for massive small‑file read‑only scenarios such as AI training datasets.

In ACK clusters, the strmvol storage volume can be used to experience this approach.

Summary

Artifacts vs Volumes?

Use Artifacts when concurrency and data volume are controllable and shared reuse is unnecessary; otherwise, consider Volumes and tune them accordingly.

Integrated vs Separate Compute‑Storage?

Integrated compute‑storage suits non‑data‑centric enterprises using open‑source solutions (MinIO, Ceph) or large AI firms with RDMA‑enabled hardware; most enterprises should adopt separate compute‑storage.

NAS vs OSS?

NAS/CPFS offers performance advantages for HPC and scenarios requiring strong consistency, high concurrency, random writes, or rich file metadata. OSS is chosen for cost‑effective, unlimited capacity and HTTP accessibility, but should be optimized based on data characteristics.