Big Data 10 min read

How Vineyard Accelerates Cloud‑Native Big Data Workflows with Zero‑Copy Memory Sharing

Vineyard, an open‑source distributed memory data‑sharing engine, tackles the inefficiencies of traditional file‑system based big‑data pipelines by enabling zero‑copy, in‑memory object exchange, Kubernetes‑aware scheduling, and plug‑in operators, delivering up to 1.34× faster end‑to‑end execution.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
How Vineyard Accelerates Cloud‑Native Big Data Workflows with Zero‑Copy Memory Sharing

Background and Motivation

Big‑data analysis pipelines often exchange intermediate results through distributed file systems such as HDFS, S3, or OSS. This introduces serialization, memory copies, network I/O, and blocks pipeline parallelism, consuming more than 40% of execution time for over 60% of tasks.

Vineyard Overview

Vineyard is a distributed engine that provides in‑memory data sharing for end‑to‑end workflows in cloud‑native environments. It graduated to a CNCF sandbox project on 27 April 2021. The source code is hosted at https://github.com/alibaba/v6d.

Vineyard workflow illustration
Vineyard workflow illustration

Key Challenges Addressed

Extra serialization, memory copy, and I/O overhead when tasks write to and read from external storage.

Difficulty integrating new compute engines, requiring repeated data format conversion.

Pipeline parallelism is blocked because downstream tasks must wait for upstream tasks to finish writing all results.

Distributed file systems ignore data locality in cloud‑native settings, causing unnecessary network traffic.

Vineyard Design Principles

Zero‑copy memory sharing : Uses memory‑mapped files so data can be shared across processes without additional I/O.

Rich object abstractions : Provides ready‑to‑use types such as Tensor, DataFrame, and Graph, eliminating serialization and allowing plug‑in components (IO, migration, snapshot) to be registered on demand.

Operators for pipeline parallelism : Includes a Pipeline operator that lets downstream tasks start processing as soon as upstream results become available.

Kubernetes‑aware scheduling : A Scheduler Plugin observes the locality of Vineyard objects and prefers placing dependent Pods on nodes that already host the required data, reducing data migration overhead.

Core Functionality

1. Distributed In‑Memory Data Sharing

Data are represented as Objects that can be local or global. A global DataFrame consists of many local Chunk objects distributed across a cluster. These objects can be directly shared with other engines such as GraphScope.

2. Flexible Metadata Model

Each Object comprises Metadata and a set of Blobs . A Blob stores raw bytes (e.g., a Tensor’s contiguous values) while Metadata describes shape, type, and layout. This design enables the same Object to be accessed from Python (as a NumPy NDArray) or C++ (as an xtensor tensor) without extra copies.

3. Pluggable Operators and Scheduler Integration

Operators enable advanced data‑flow patterns, and a Kubernetes Custom Resource Definition (CRD) exposes Vineyard Objects as observable resources. The custom scheduler plugin reads object metadata to co‑locate tasks, avoiding costly data transfers.

Performance Evaluation

Benchmarking against HDFS‑based sharing shows that Vineyard reduces intermediate‑data exchange overhead and improves overall workflow execution time by a factor of 1.34×.

Quick Start

Vineyard can be installed via Helm:

helm repo add vineyard https://vineyard.oss-ap-southeast-1.aliyuncs.com/charts/</code>
<code>helm install vineyard vineyard/vineyard

After installation a Vineyard DaemonSet runs and exposes a UNIX domain socket for shared‑memory and IPC communication between application Pods.

Future Outlook

Vineyard already powers the Mars distributed scientific computing engine and the GraphScope graph system. Ongoing work focuses on tighter integration with cloud‑native projects such as Kubeflow and Fluid to further simplify large‑scale data analytics on Kubernetes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeBig DataKubernetesVineyardMemory Sharing
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.