GooseFS + Lance: Accelerating Vector Storage for the AI Era
The article explains how GooseFS integrates with the Lance vector format to overcome the IO bottlenecks of object storage, detailing native acceleration mechanisms such as namespace catalog services, event‑driven warm caching, automatic compaction, native transactions, and page‑level caching that together deliver up to three‑fold performance gains for AI workloads.
AI‑Driven Data Growth and the IO Dilemma
Large‑model and multimodal applications generate massive vector‑rich datasets. The Lance format provides native vector support, high‑performance random access, and zero‑cost schema evolution, making it a core component of the AI data stack.
Running Lance directly on bare object storage (e.g., Tencent Cloud COS) incurs high network latency, lacks multi‑writer transaction control, depends heavily on metadata operations such as List/Head, and struggles with multi‑table transaction requirements. These issues constitute the “IO dilemma” of Lance on object storage.
GooseFS + Lance: From Cache to Native Acceleration
GooseFS was originally built to place a cache close to compute clusters, achieving 2‑to‑10× performance gains in OLAP, lakehouse, and autonomous‑driving training workloads. To unlock deeper value for Lance, the system adopts and extends the Lance Namespace specification, turning GooseFS into an interactive catalog service rather than a transparent cache layer.
Namespace Integration
By implementing the Lance Namespace, GooseFS becomes a catalog service that receives all requests from upper‑level engines (Spark, Ray, DuckDB, etc.). It parses execution plans, extracts dependent datafile paths, and enables proactive optimizations such as data pre‑heating. Even with only basic caching, GooseFS delivers roughly 3× performance over raw object storage.
Engine Collaboration Architecture
The architecture consists of a file‑cache layer below and a catalog service above. The catalog service acts as the “brain”, implementing two core mechanisms:
Event Interception & Warm Cache
Background Compaction & Water‑Level Governance
1. Event Interception & Warm Cache
GooseFS registers a TableEventHandler that intercepts every table query or write operation. The handler uses the Lance SDK to resolve exact datafile paths and submits them to a Job Service for asynchronous pre‑heating, creating a Warm Cache.
Statistics‑driven predictive pre‑heat refines this process:
Hotness Tracking – continuously records access frequencies of tables, partitions, and columns.
Hotspot Prediction – identifies frequently accessed “hot tables” and “hot columns”.
Precise Pre‑heat – prioritizes and proactively loads data associated with hotspots into the cache.
2. Small‑File Governance via Background Compaction
Frequent writes generate many fragmented small files, degrading read performance. GooseFS monitors a “water‑level” threshold for small‑file count or size. When the threshold is crossed, the Job Service triggers a compaction task that merges numerous small files into larger ones and rebuilds related metadata and index caches, keeping storage efficiency and query performance healthy.
3. Native Transaction Support
Earlier object‑storage‑based solutions relied on Put‑If‑Not‑Exists semantics or external manifest stores, adding architectural complexity. GooseFS’s centralized Table Service maintains transaction‑related metadata ( _versions, _manifests, _transactions) in memory, providing out‑of‑the‑box single‑table concurrent transaction support. Ongoing research targets multi‑table transactions and transaction merging for multi‑worker writes.
4. Mini‑Block Point‑Lookup Acceleration with Paged Cache
Lance’s Mini Block structure avoids decompressing entire row groups for point queries. GooseFS introduces a paged cache that stores only the tiny chunks corresponding to a Mini Block, eliminating the read‑amplification of traditional block‑level caches.
Benchmarks on CLIP‑based diffusion model queries show:
≈ 3× performance gain over raw object storage.
> 2× gain compared with conventional block‑cache systems.
Practice and Outlook
The GooseFS + Lance stack is deployed in Tencent’s OpenCloud Memory Store, providing an efficient, scalable data foundation for Agent applications.
Future directions include:
Native Query Capability – integrating lightweight query processing (e.g., via Ray) to bring computation closer to storage.
Deep Integration with Vector‑Bucket Products – combining with Tencent Cloud’s vector bucket services to build a full‑stack AI data solution from hardware to applications.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
