Big Data 15 min read

Accelerate Lakehouse Queries: A Hands‑On Guide to StarRocks + Apache Iceberg

This tutorial walks you through the fundamentals of Apache Iceberg, its architecture and key features, explains why it’s advantageous for lakehouse workloads, and provides a step‑by‑step Docker‑Compose setup to integrate Iceberg with StarRocks for fast, ACID‑compliant analytics on real‑world taxi data.

StarRocks

Sep 5, 2024

Apache Iceberg Overview

Apache Iceberg is an open‑source table format for petabyte‑scale datasets. It sits between compute engines (Spark, Flink, etc.) and storage formats (Parquet, ORC, Avro), providing a unified table abstraction that works on HDFS, S3, OSS and other object stores.

Architecture

Data layer : stores actual data files (Parquet, ORC, …).

Metadata layer : multi‑level metadata that records schema, partitioning, snapshots and manifests.

Catalog layer : points to the metadata location; implementations include HadoopCatalog, HiveCatalog and REST catalog.

Metadata management consists of:

Metadata file : JSON file that records the current table version and list of snapshots.

Snapshot : immutable point‑in‑time view; each commit creates a new snapshot that references one or more manifest files.

Manifest : a list of data files (with file path, format, partition values, metrics) that belong to a snapshot, enabling fast pruning.

Iceberg’s core capability is snapshot‑based versioning, which tracks every change, provides atomic writes, and enables incremental reads and time‑travel queries.

Key Features

Hidden partitioning : automatic partition evolution without exposing partition columns to the user.

Schema evolution : add, drop, rename columns without rewriting existing files; history is retained.

Partition evolution : change partition spec over time while preserving older data.

Multi‑Version Concurrency Control (MVCC) : writers create new snapshots; readers see a consistent snapshot without blocking.

Optimistic locking : concurrent writes are validated against the current snapshot to guarantee atomicity.

Row‑level updates : V1 uses copy‑on‑write; V2 adds merge‑on‑read with position and equality deletes.

StarRocks × Iceberg Query Acceleration

Metadata caching : StarRocks caches Iceberg metadata and parallelizes manifest reads to reduce I/O.

Cost‑Based Optimizer (CBO) : uses Iceberg statistics (including column histograms) to generate efficient execution plans.

File format tuning : optimized Parquet/ORC readers lower scan volume and I/O.

Internal‑external table unification : data cache and smart materialized views hide differences between native and external tables, supporting query rewrite and incremental refresh.

Native Iceberg support : StarRocks can read and write Iceberg tables via the external catalog without data migration.

Quick‑Start Tutorial

Deploy the environment : use Docker Compose to start six containers (MinIO object storage, Spark‑Iceberg, Iceberg REST catalog, StarRocks FE/BE, etc.).

Download Docker Compose file and sample dataset :

mkdir iceberg
cd iceberg
curl -O https://raw.githubusercontent.com/StarRocks/demo/master/documentation-samples/iceberg/docker-compose.yml
curl -O https://raw.githubusercontent.com/StarRocks/demo/master/documentation-samples/iceberg/datasets/green_tripdata_2023-05.parquet

Start containers : docker compose up -d Copy the Parquet file into the Spark‑Iceberg container :

docker compose cp green_tripdata_2023-05.parquet spark-iceberg:/opt/spark/

Launch PySpark : docker compose exec -it spark-iceberg pyspark Read the dataset into a DataFrame and verify :

# Read Parquet
df = spark.read.parquet("/opt/spark/green_tripdata_2023-05.parquet")
df.printSchema()
df.select(df.columns[:7]).show(3)

Create an Iceberg table and write the DataFrame : df.writeTo("demo.nyc.greentaxis").create() Configure StarRocks to access the Iceberg catalog :

CREATE EXTERNAL CATALOG iceberg
PROPERTIES (
  "type"="iceberg",
  "iceberg.catalog.type"="rest",
  "iceberg.catalog.uri"="http://iceberg-rest:8181",
  "iceberg.catalog.warehouse"="warehouse",
  "aws.s3.access_key"="admin",
  "aws.s3.secret_key"="password",
  "aws.s3.endpoint"="http://minio:9000",
  "aws.s3.enable_path_style_access"="true",
  "client.factory"="com.starrocks.connector.iceberg.IcebergAwsClientFactory"
);
SHOW CATALOGS;
SET CATALOG iceberg;
SHOW DATABASES;

Query Iceberg data from StarRocks :

SELECT lpep_pickup_datetime FROM greentaxis LIMIT 10;

SELECT COUNT(*) AS trips,
       hour(lpep_pickup_datetime) AS hour_of_day
FROM greentaxis
GROUP BY hour_of_day
ORDER BY trips DESC;

Diagram of Docker Compose Services

data engineering Docker SQL StarRocks Apache Iceberg lakehouse PySpark

Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Apache Iceberg Overview

Architecture

Key Features

StarRocks × Iceberg Query Acceleration

Quick‑Start Tutorial

Diagram of Docker Compose Services

StarRocks

How this landed with the community

Was this worth your time?

0 Comments

StarRocks × Iceberg Query Acceleration