Big Data 19 min read

Unlocking Data Lake Power: Iceberg Architecture & StarRocks Acceleration

Apache Iceberg offers a modern, ACID‑compliant table format for data lakes with features like hidden partitions and schema evolution, while StarRocks provides high‑performance query acceleration, metadata caching, and distributed planning to address Iceberg’s latency challenges, enabling seamless lake‑warehouse integration and real‑time analytics.

StarRocks

May 22, 2024

Iceberg Overview

Apache Iceberg is an open‑source table format designed for large‑scale analytical data in data lakes. It supports ACID transactions, hidden partitions, schema evolution, partition evolution, MVCC, and consistent reads, making it suitable for demanding big‑data workloads.

Challenges of Traditional Hive Tables

High cost of row‑level updates; real‑time scenarios are missing.

Missing ACID support before Hive 3.0, leading to concurrency issues.

Write operations lack atomicity, causing partial overwrites.

Schema changes require full table rebuild.

Partition scheme cannot be altered without recreating tables.

File‑level metadata is absent, forcing expensive filesystem scans.

No automatic derivation of partition columns.

Statistics updates are delayed, hurting optimizer decisions.

Compatibility issues with object stores such as Amazon S3.

Iceberg Architecture

Data Layer : Stores actual data files (Parquet, ORC, etc.).

Metadata Layer : Multi‑level metadata that tracks table schema and file indexes.

Catalog Layer : Pointers to metadata locations; implementations include HadoopCatalog (uses HDFS directories) and HiveCatalog (stores metadata in Hive Metastore).

Key Iceberg Features

Hidden Partition : Users can define time‑based partitions (e.g., days) that are transparent to queries. Example:

CREATE TABLE catalog.MyTable (..., ts TIMESTAMP) PARTITIONED BY days(ts)

Schema Evolution : Add, drop, rename, or modify columns without rewriting data; each change is recorded in metadata with column IDs for accurate read/write.

Partition Evolution : New data can adopt a new partition scheme while existing files retain the old scheme; metadata preserves historic partition schemas.

MVCC (Multi‑Version Concurrency Control) : Guarantees snapshot isolation; delete/overwrite modify manifest files without deleting underlying data files.

Consistency : Optimistic locking handles concurrent writes; conflicts result in only one successful write.

Row‑level Updates :

COW (Copy‑On‑Write) – V1 tables copy data and apply changes on a new copy.

MOR (Merge‑On‑Read) – V2 tables use position deletes and equality deletes, applying logical deletions at read time.

Performance Challenges When Querying Iceberg

Iceberg records detailed file‑level metadata, so scanning large tables can be slow. Manifest files are often heavily compressed (e.g., an 8 MB manifest may expand to 300 MB in memory). Parsing these manifests can take ~1 second per file in C++, and in big‑metadata scenarios planning can take tens of minutes.

Even with multithreaded streaming, high concurrency can amplify GC pauses, making query latency unacceptable for real‑time analytics.

StarRocks as an Acceleration Engine for Iceberg

StarRocks can query external Iceberg tables via its External Catalog without data migration. It supports Iceberg V1 and V2 (including position and equality deletes) and can write back to Iceberg tables.

Metadata Retrieval and Parsing Optimizations

Metadata cache reduces repeated remote I/O.

From version 3.3, StarRocks introduces a distributed Job Plan that parallelizes manifest reading and filtering across compute nodes, returning a unified result to the front‑end.

Manifest Cache stores deserialized manifests, avoiding repeated decompression and decoding.

Cost‑Based Optimizer (CBO) Enhancements

Since version 3.2, StarRocks collects statistics for external tables (including Hive and Iceberg). Version 3.3 adds histogram statistics and struct‑type sub‑column stats, providing richer inputs for optimal plan generation.

File‑Format Specific Optimizations

Leverages Parquet and ORC footers and dictionary information to reduce scan volume.

Implements adaptive I/O merging to lower remote I/O overhead.

Extends connector framework to support Avro, SequenceFile, and RCFile.

Data Cache and Materialized Views

Data Cache stores hot data blocks locally (memory + disk) and serves subsequent queries directly from cache, eliminating remote I/O for repeated accesses.

Materialized Views pre‑compute complex queries and refresh incrementally based on partition changes, offering near‑real‑time performance without user‑managed pipelines.

Real‑World Deployment: WeChat Lake‑Warehouse Integration

WeChat adopted a lake‑warehouse architecture using StarRocks + Iceberg to achieve:

Data freshness improved from hour‑/day‑level to minute‑level.

Query latency reduced from minutes to seconds for 80 % of large queries.

Seamless migration from Presto + Hive to StarRocks + Iceberg, cutting storage costs by over 65 % and halving development tasks.

The solution combines external table querying, distributed manifest processing, metadata caching, and materialized views to deliver a unified, high‑performance analytics platform.

The Latest Data Trends and Predictions for 2024 by Dremio https://www.youtube.com/watch?v=bULVEsss7y

StarRocks Metadata Caching data lake schema evolution Apache Iceberg

Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.