Big Data 20 min read

Improving Interactive Analysis on Massive Datasets with Data Clustering and Data Skipping Using Spark and Iceberg

This article explores how data clustering techniques such as linear order, Z‑order, and Hilbert‑curve ordering can be applied in Apache Spark and Apache Iceberg to achieve efficient data skipping on terabyte‑scale tables, dramatically reducing file scans and enabling sub‑second interactive analytics for multi‑dimensional queries.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Improving Interactive Analysis on Massive Datasets with Data Clustering and Data Skipping Using Spark and Iceberg

Interactive analysis is a key requirement for big‑data workloads, but full‑table scans on TB‑ or PB‑scale datasets cannot meet sub‑second response times; data clustering and data skipping aim to read only the relevant files by aligning storage layout with filter predicates.

Data skipping relies on tight cooperation between the SQL engine (e.g., filter‑push‑down) and the storage format (e.g., Min/Max statistics, Bloom filters) in formats such as Hive, Spark, Presto, Hudi, Iceberg, Parquet, and ORC.

The experiments are built on Apache Spark and Apache Iceberg, extending Spark to support custom data clustering and measuring the impact on data skipping for the Star Schema Benchmark (SSB) at scale‑100.

Apache Spark provides a flexible SQL/DataFrame API and a high‑performance runtime, while Apache Iceberg offers table‑level metadata, fine‑grained transactions, file‑level indexing, and automatic compaction, making it suitable for large‑scale analytical tables.

SSB is used as a benchmark; three filter columns (s_city, c_city, p_brand) with distinct cardinalities are selected. A wide table CREATE TABLE lo_iceberg USING iceberg AS SELECT * FROM lineorder JOIN dates ... DISTRIBUTE BY random() is created, containing 533,363,833 rows.

Data organization strategies such as partition directories, file merging parameters (e.g., hive.merge.mapredfiles ), and columnar formats with row‑group indexes are discussed, emphasizing that optimal clustering can dramatically shrink the Min/Max range of filtered columns.

Linear order (global ORDER BY vs. partition‑level SORT BY ) is evaluated using Spark's repartitionByRange to enforce ordering on the three filter columns. Results show that only the first ordering column (s_city) benefits from data skipping (99.9% files skipped), while the others scan all 1,000 files.

Interleaved order (Z‑order) is introduced as a way to interleave bits of multiple columns into a single z‑value, enabling simultaneous clustering on several dimensions. Challenges such as handling signed integers, varying bit widths, and non‑numeric types are described.

A boundary‑based interleaved index is proposed: sample the data, compute a limited set of sorted boundaries for each filter column, map each value to its boundary index (a contiguous integer starting at 0), and then compute the z‑value. This approach works for all data types.

The Z‑order implementation in Spark uses spark.read.table("hive_catalog.ssb.lo_iceberg").repartitionByZOrderRange(1000, $"s_city", $"c_city", $"p_brand").writeTo("hive_catalog.ssb.lo_iceberg_zorder").using("iceberg").create . After rewriting, file‑scan counts drop to 186 (81.4%) for s_city, 164 (83.6%) for c_city, and 135 (86.5%) for p_brand.

Hilbert‑curve ordering is presented as an alternative that preserves spatial locality better than Z‑order. A similar boundary‑based Hilbert index is built and applied with spark.read.table("hive_catalog.ssb.lo_iceberg").repartitionByHibertRange(1000, $"s_city", $"c_city", $"p_brand").writeTo("hive_catalog.ssb.lo_iceberg_hibert").using("iceberg").create . The resulting scans are 145 (85.5%) for s_city, 131 (86.9%) for c_city, and 117 (88.3%) for p_brand, showing further improvement over Z‑order.

The paper concludes that Z‑order and Hilbert‑curve clustering can dramatically increase data‑skipping ratios for multiple filter columns, and suggests future work on weight‑based clustering and machine‑learning‑driven data‑clustering strategies.

For collaboration, the authors invite readers to contact [email protected].

big dataSparkicebergData Skippinghilbert curvedata clusteringZ-Order
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.