How Adaptive Structural Encoding Boosts Random Access in Columnar Storage
This article examines how adaptive structural encoding in columnar formats like Lance dramatically improves random‑access performance on NVMe storage, compares it with Apache Parquet and Arrow, and discusses the trade‑offs between scan speed, memory usage, and compression.
Abstract
Artificial intelligence workloads increasingly require a mix of sequential and random accesses. NVMe‑based storage can cache large columnar datasets in the cloud, but existing columnar formats under‑utilize NVMe performance, especially for random reads.
The paper evaluates Apache Arrow, Apache Parquet and Lance on NVMe for random‑access and full‑scan tasks. Efficient column‑level encoding of repetition and validity information is key to unlocking disk performance. With proper tuning Parquet’s random‑access can improve >60×, but at the cost of scan speed and higher memory usage. Lance’s adaptive structural encoding avoids these trade‑offs and delivers superior random‑access without sacrificing scan performance.
Introduction
Modern data lakes and warehouses rely on columnar representation for compression and vectorized computation, storing data in cloud object stores that offer high bandwidth but lower IOPS than local NVMe. Formats such as Parquet and ORC excel at full‑table scans, yet perform poorly on search‑oriented workloads that need many small random reads.
Figure 1 (illustrated below) compares an NVMe SSD (Samsung 970 EVO Plus) with S3 (c7gn.8xlarge), showing that S3 is limited to tens of thousands IOPS and offers no advantage for reads under ~100 KB, whereas NVMe can sustain hundreds of thousands IOPS with 4 KiB sector reads.
Cache‑layer solutions (e.g., WekaFS, JuiceFS, S3 Express) mitigate this gap, but the underlying file format still determines random‑access efficiency.
Overview
Columnar formats store tables column‑first, requiring at least one I/O per column for row‑wise random access. Inefficient encoding can increase IOPS, read amplification, and CPU overhead.
The paper proceeds to describe generic columnar structures, then details compression and structural encoding, introduces Lance’s adaptive scheme, and finally benchmarks existing formats against Lance.
Columnar Storage
Columnar storage encodes tables into binary files with components such as row groups, column chunks, and pages. Apache Parquet (v54) provides a page‑offset lookup structure crucial for random access. Apache Arrow IPC encodes Arrow’s in‑memory format with limited compression. Lance (v2.1, experimental) aims to balance scan and random access.
File Format
All formats share similar components: row groups (collections of rows), column chunks (contiguous buffers per column), and pages (sub‑chunks within column chunks). Lance uses “disk pages” analogous to Parquet pages when miniblock encoding is enabled.
Compression
Compression schemes are tied to structural encoding. Transparent encodings (bit‑packing, FSST, dictionary) allow independent decompression of values, while opaque encodings (Snappy, delta‑length) require batch decompression. Sparse compression removes nulls and stores validity bitmaps separately; dense compression retains placeholders to enable random access.
Evaluation Criteria
Random‑access performance is measured by IOPS per value and read amplification, assuming “hot” search where metadata is cached in memory. Full‑scan performance focuses on compression ratio and total data read, as a single I/O can retrieve an entire column chunk.
Columnar Encoding Strategies
Encoding consists of structural encoding (shredding nested arrays into leaf buffers) followed by compressive encoding. Structural encoding determines how column chunks are split into buffers and directly impacts random‑access efficiency.
Parquet Structural Encoding Scheme
Parquet flattens nested arrays into leaf columns containing a list of values (nulls removed), repetition levels, and definition levels. Values are stored in pages, each accompanied by its repetition and definition buffers. Page‑offset indexes serve as a search cache, enabling binary search to locate the target page with a read amplification equal to the page size.
Arrow Structural Encoding Scheme
Arrow also flattens nested data but stores dense columns where nulls are padded with placeholder bytes. It uses offset arrays for repeated structures and validity bitmaps for definitions, without pages or chunking. Compression applies to entire column chunks, making Arrow suitable for in‑memory random access but less efficient for disk‑based workloads, especially for nested or variable‑length types such as List<String>, which may require multiple I/O operations per value.
Big Data Technology Tribe
Focused on computer science and cutting‑edge tech, we distill complex knowledge into clear, actionable insights. We track tech evolution, share industry trends and deep analysis, helping you keep learning, boost your technical edge, and ride the digital wave forward.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
