Big Data 18 min read

A New Data Lake Paradigm: Volcano Engine’s Multi‑Modal Data Lake Built on Lance

The article presents Volcano Engine’s AI‑focused data lake built on the Lance format, detailing why traditional lakes fall short for multimodal data, the engineering enhancements such as Binary Copy Compaction, Lance Insight, distributed vector indexing, JSON‑based tagging, Row‑ID shuffle optimization, and real‑world case studies that demonstrate significant performance and cost gains.

DataFunSummit
DataFunSummit
DataFunSummit
A New Data Lake Paradigm: Volcano Engine’s Multi‑Modal Data Lake Built on Lance

Introduction: Why a New Data Lake Format for AI?

With the explosive growth of multimodal data (images, video, point clouds, embeddings) in AI, traditional columnar lake formats like Parquet struggle with large single‑value sizes, unified management, high‑frequency column addition, and pre‑training shuffle bottlenecks. Lance is introduced as a modern file format designed for AI, offering native multimodal storage, high‑performance random access, zero‑cost column addition, and cloud‑native compatibility.

Volcano Engine’s Deep Enhancements to Lance

1. Binary Copy Compaction – Eliminating CPU‑Intensive Merges

Traditional compaction reads, decompresses, deserializes, merges, and rewrites data, consuming large CPU and memory. Volcano Engine contributed Binary Copy Compaction, which copies raw pages and buffers at the byte level and only rebuilds metadata, drastically reducing CPU load.

Effectiveness: In a test on 5 million rows of complex nested data, merge time dropped from 418.6 s to 15.3 s, a 27× speedup.

Limitations and Two‑Stage Strategy: It does not support Deletion Vectors and may still suffer small‑read penalties on many KB‑size files. Volcano Engine therefore applies a two‑stage approach: (1) high‑concurrency ordinary compaction to quickly coalesce KB files into MB files, then (2) low‑concurrency Binary Copy Compaction to merge MB files into GB‑scale files.

2. Lance Insight – Intelligent Observability, Diagnosis, and Optimization

Lance Insight provides three core commands: SHOW: Retrieves table metadata, file layout, and statistics (OVERVIEW, MANIFEST, INDEX). ANALYZE: Performs health checks, detecting small files, page‑size issues, index coverage, and query plan problems. OPTIMIZE: Recommends and executes compaction, cleanup, and index tuning based on the diagnostics.

3. Distributed Vector Index Construction

The default IVF‑PQ index is efficient but single‑node construction hits memory (≥500 GB), time (hours‑days), and resource waste. Volcano Engine distributes the workload:

Coordinator samples data and runs K‑Means to produce centroids and a PQ codebook.

Workers receive the centroids/codebook and process assigned fragments in parallel, computing each vector’s cluster and PQ code.

Coordinator aggregates the results into a global inverted index.

This design integrates with Ray, Daft, etc., freeing cloud‑native resources.

4. AI Tag System – Dense and Sparse Labels with JSONB

Dense tags are added via Lance’s zero‑cost column addition. Sparse tags are stored in a JSON column; Lance encodes JSON as compact JSONB, enabling efficient storage and direct scalar indexing on JSON paths (e.g., user.name). A full‑text index flattens JSON objects into {path, type, value} triples, applies secondary tokenization (e.g., Jieba for Chinese strings), and builds a token index, allowing fast MATCH or PHRASE queries.

5. Shuffle Optimization – Shuffling Row IDs Instead of Full Rows

Before training, traditional shuffle moves entire rows containing large embeddings, causing high memory usage. Lance’s Row ID mechanism shuffles only lightweight Row ID lists, then workers use the take operation to fetch the required rows on demand, dramatically reducing memory and network overhead.

LAS AI Data Lake Architecture

The product, LAS AI Data Lake, is built on Lance and comprises four layers:

Storage Layer: Lance core with compatibility for Iceberg, Petastorm, WebDataset.

Compute Layer: Supports Spark, Daft, Ray and integrates large models (e.g., Doubao, DeepSeek).

AI Operator Layer: Hundreds of ready‑to‑use operators (PDF parsing, text cleaning, video frame extraction, etc.).

Agent/Skill Layer: Encapsulates common data‑processing capabilities as reusable Agent Skills.

A central catalog built on Trino and Hive Metastore provides automatic merging, smart cleanup, index management, versioning, and data exploration.

Real‑World Cases

Case 1 – Autonomous Driving Model Training

Data preprocessing time reduced from 7 days to 1 day (7× speedup) by replacing Argo‑based scheduling with Ray.

GPU utilization increased from < 50% to > 95% via a custom offline‑online release and multi‑model sharing solution.

Zero‑cost column addition, data compression, and branch‑based versioning lowered storage cost and simplified permission management.

Training data loading efficiency improved by 1.5× using Lance’s random read and Row ID shuffle.

Case 2 – Foundation Model Data Delivery

Introduced Branches to avoid physical intermediate tables; deletions become logical markers.

Zero‑copy column addition enabled efficient scoring on filtered branches.

Built‑in Timeline provides full data‑lineage traceability.

Summary and Future Directions

Volcano Engine’s work on Lance spans performance optimization, intelligent ops, index enhancement, and ecosystem integration, delivering measurable value in autonomous driving and large‑model pipelines.

Future focus areas:

Lance for Agent: Strengthen Branch/Tag, improve edge‑cloud sync for AI agents.

Lance Partition: Add partitioning to manage ultra‑large tables and deepen Spark integration.

Blob V2 Evolution: Enable cross‑bucket, cross‑account multimodal data management and standardize Blob URLs for direct access.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIData Lakemulti-modalShuffle OptimizationLanceBinary Copy CompactionDistributed Vector Index
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.