How to Pick the Best Storage for Kubernetes Workflows: Artifacts vs Volumes
This article examines the storage challenges of Kubernetes‑based Argo Workflows, comparing artifact mechanisms and native volumes, evaluating integrated versus separated compute‑storage architectures, and presenting performance‑oriented optimization techniques for object and file storage in AI and big‑data pipelines.
Introduction: Data in Workflows
In Kubernetes environments, workflow scenarios are far more sensitive to data access performance than traditional applications; storage system performance directly determines execution efficiency and stability.
Using large‑scale Argo Workflows as an example, elastic scaling and dynamic scheduling demand high concurrency and throughput from the storage layer while keeping costs under control.
Artifacts vs Volumes
Argo Workflow storage read/write is implemented mainly through two mechanisms:
Artifact mechanism . It wraps MinIO or cloud‑provider object‑storage SDKs; users declare input/output paths in the pipeline YAML. Suitable for controlled concurrency and data volume where shared reuse is not required.
Kubernetes native Volume mechanism . Storage media are mounted into Pods, enabling shared storage for short‑lived, high‑concurrency Pods.
The underlying storage architecture differences directly affect volume read/write performance, which is the focus of the following discussion.
Integrated Compute‑Storage vs Separate Compute‑Storage
Two architectural models are considered:
Integrated compute‑storage . In container environments, Pods can access shared local storage (e.g., node‑pooled system disks) via the container network.
Separate compute‑storage . Cloud object storage (e.g., OSS) decouples data from compute, enabling elastic scaling.
Key factors when selecting a storage architecture include:
Scalability and flexibility : Local storage capacity is limited by node count and type; cloud storage offers elastic capacity and pay‑as‑you‑go pricing.
Performance and latency : Local storage provides low latency and high throughput but may be limited by single‑node I/O; cloud storage can be accelerated with caching or specialized services such as CPFS.
Cost and operational overhead : Local storage reuses idle node resources but requires system maintenance; cloud storage incurs data transfer and tiering costs.
For Argo Workflows, the maximum concurrent read/write capability of the storage system must be considered to avoid I/O bottlenecks.
Temporary intermediate data should be isolated on demand (e.g., namespace isolation) and cleaned up via TTL policies or automated scripts to reduce redundancy.
File Storage vs Object Storage
Cloud storage can be classified into block, file (NAS), and object (OSS) types. In strong‑job workflows like Argo, NAS and OSS are commonly used.
NAS : Provides full POSIX semantics, excels in massive small‑file and metadata‑intensive scenarios such as HPC.
OSS : Offers unlimited capacity, low cost, and HTTP access, making it indispensable for big‑data and AI training pipelines (e.g., autonomous driving data ingestion).
OSS: From Performance Bottlenecks to Optimization
Root Causes of Performance Bottlenecks
Object storage’s flat layout conflicts with POSIX file systems; accessing OSS via POSIX requires a FUSE client that translates system calls to HTTP requests, adding protocol overhead.
In Kubernetes, the CSI driver runs the FUSE client, mounts the object store into the pod’s namespace, and only after successful mounting can the workload interact with OSS.
FUSE introduces additional latency, and object storage’s characteristics further exacerbate bottlenecks, such as write‑only overwrite semantics and limited metadata.
Common Performance Optimization Strategies
Modify server‑side storage layout : Store data in formats optimized for OSS while preserving POSIX compatibility.
Use lightweight clients : Focus on sequential reads/writes and omit metadata handling for read‑heavy AI workloads.
Cluster‑side distributed caching : Pool node memory/disk to cache hot data, reducing direct OSS API calls.
Server‑side acceleration : Leverage cloud‑provider acceleration services (e.g., OSS accelerator) at additional cost.
Breaking the FUSE Performance Barrier
A new approach explores virtual block storage with kernel‑mode file systems (e.g., erofs) and UBLK to bypass FUSE, dramatically lowering context‑switch overhead.
This solution is especially effective for massive small‑file read‑only scenarios such as AI training datasets.
In ACK clusters, the strmvol storage volume can be used to experience this approach.
Summary
Artifacts vs Volumes?
Use Artifacts when concurrency and data volume are controllable and shared reuse is unnecessary; otherwise, consider Volumes and tune them accordingly.
Integrated vs Separate Compute‑Storage?
Integrated compute‑storage suits non‑data‑centric enterprises using open‑source solutions (MinIO, Ceph) or large AI firms with RDMA‑enabled hardware; most enterprises should adopt separate compute‑storage.
NAS vs OSS?
NAS/CPFS offers performance advantages for HPC and scenarios requiring strong consistency, high concurrency, random writes, or rich file metadata. OSS is chosen for cost‑effective, unlimited capacity and HTTP accessibility, but should be optimized based on data characteristics.
OSS Optimization Techniques
Common methods include server‑side layout changes, lightweight clients, distributed caching, and acceleration services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
