How StarRocks and Iceberg Enable Federated Queries: A Practical Walkthrough
This article details Fresha's real‑world integration of StarRocks with Apache Iceberg, covering metadata planning, distributed execution, adaptive metadata retrieval, hot‑cold data layering, missing statistics handling, catalog configuration, and performance optimizations that together demonstrate how federated queries can be efficiently executed over data‑lake tables.
Data lakes are a key scenario for query federation; with compute‑storage separation, many tables are stored in object storage using Apache Iceberg.
Iceberg query performance depends on efficient metadata lookup and predicate push‑down/partition pruning. As file count grows, small files increase latency, so compaction is essential.
Compared with Trino, which reads Iceberg directly from object storage, StarRocks reduces remote I/O through a metadata cache, local execution, materialized views and compaction.
StarRocks also supports hot‑cold data layering: recent data is written to native StarRocks tables for real‑time analysis, while historical data stays in Iceberg and is accessed on demand, lowering storage cost and latency.
In Fresha’s practice, connecting an external Iceberg catalog to StarRocks gave two main benefits: (1) reuse of Snowflake‑exported Iceberg data models as a unified SQL entry point, and (2) reduced query latency for real‑time analytics via hot‑cold layering.
StarRocks & Iceberg Integration
StarRocks can query Iceberg tables as native tables. The query flow has two phases: metadata planning and distributed execution.
Metadata Planning
Read the current snapshot ID to locate the table metadata.
Read the manifest list and retrieve manifest files.
Prune files based on metadata, filtering out irrelevant data files.
Generate an execution plan and dispatch it to compute nodes (CN).
Distributed Execution
Each compute node reads assigned data files from remote storage.
StarRocks uses a vectorized columnar engine for reading.
Predicates that cannot be handled in metadata planning (e.g., non‑partition columns) are pushed down further to the storage layer.
Position‑delete and equality‑delete files are processed to satisfy Iceberg’s transactional consistency.
To improve efficiency, StarRocks caches Iceberg metadata in memory and on local disk with LRU eviction, and introduces Adaptive Metadata Retrieval: large queries distribute manifest reading, decompression and filtering across multiple CN nodes, while small queries reuse cached deserialized objects.
An asynchronous metadata refresh polls the metastore for catalog changes and updates cached metadata for frequently accessed tables; idle tables stop background tasks.
Missing Statistics Issue
Early Fresha tests showed performance anomalies and even CN segmentation faults on Iceberg tables exported from Snowflake because Parquet files lacked page index and null‑value‑count statistics. StarRocks relies on these statistics (configurable via enable_parquet_reader_page_index) for optimization; missing stats caused null‑pointer accesses.
The problem was solved by using Spark as the Iceberg writer, which generates Parquet files with page index by default, allowing StarRocks to keep its page‑index optimization enabled without performance loss.
Iceberg Catalog Configuration
CREATE EXTERNAL CATALOG lakekeeper
PROPERTIES (
"type" = "iceberg",
"aws.s3.region" = "us-east-1",
"aws.s3.endpoint" = "https://s3.us-east-1.amazonaws.com",
"aws.s3.path_style_access" = "false",
"aws.s3.use_aws_sdk_default_behavior" = "true"
); -- Iceberg catalog configuration
-- REST Catalog with vended credentials
"iceberg.catalog.type" = "rest",
"iceberg.catalog.uri" = "{{ ssm_param.lakekeeper_endpoint }}",
"iceberg.catalog.security" = "oauth2",
"iceberg.catalog.oauth2.scope" = "lakekeeper",
"iceberg.catalog.oauth2.server-uri" = "{{ ssm_param.keycloak_token_endpoint }}",
"iceberg.catalog.oauth2.credential" = "{{ ssm_param.client_id }}:{{ ssm_param.client_secret }}",
"iceberg.catalog.rest.nested-namespace-enabled" = "true",
"iceberg.catalog.warehouse" = "{{ environment }}",
-- Metadata cache settings
"enable_iceberg_metadata_cache" = "true",
"enable_iceberg_metadata_disk_cache" = "true",
"iceberg_metadata_cache_expiration_seconds" = "1800",
"iceberg_metadata_memory_cache_expiration_seconds" = "1800"
);After configuration, the external catalog can be queried directly, e.g.:
show databases in lakekeeper;
show tables in lakekeeper.remote_iceberg_data;
SELECT count(*) FROM snowflake__partners_reporting__location_all_time_metrics
WHERE provider_id = 604262;Fresha uses a self‑built tool NorthStar to visualize the execution plan. A sample EXPLAIN VERBOSE output shows the Iceberg scan node, predicate push‑down, manifest statistics, and data‑cache options.
StarRocks fully exploits Iceberg statistics for query optimization.
Irrelevant manifest files are automatically skipped.
Predicate push‑down reduces data transfer.
Early column projection (pruning) is applied.
Data cache is populated during query execution.
These optimizations improve Iceberg query performance, provided the Parquet files contain the required page index.
Other External Catalogs
Beyond Iceberg, StarRocks supports other open table formats and external data sources such as Elasticsearch and Apache Paimon.
Overall, StarRocks delivers the core capabilities needed for a lakehouse architecture: compute‑storage separation, unified access to heterogeneous data sources, hot‑cold data layering, and flexible caching to reduce remote scans and query cost.
This article extracts and translates the practice part of the original post. For background on query federation and design differences between Trino and StarRocks, see the original article: 👉 https://medium.com/fresha-data-engineering/jack-of-all-trades-query-federation-in-modern-olap-databases-8be0fdf2ade9
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
StarRocks
StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
