Big Data 16 min read

How Alibaba Cloud EMR StarRocks Supercharges Data Lake Analytics with Advanced Optimizations

This article explains how Alibaba Cloud EMR StarRocks extends data lake analytics to support Hive, Iceberg, and Hudi, detailing its architecture, Iceberg integration, performance gains over Trino, IO merging, lazy materialization, intelligent caching, and elastic compute capabilities for faster, unified, and cost‑effective queries.

StarRocks
StarRocks
StarRocks
How Alibaba Cloud EMR StarRocks Supercharges Data Lake Analytics with Advanced Optimizations

Background and Goals

Since 2021, Alibaba Cloud EMR OLAP team and the StarRocks community have collaborated to enable StarRocks to analyze data stored not only in its native storage but also in external data lakes such as Apache Hive, Iceberg, and Hudi. The aim is to provide an ultra‑fast, unified analysis experience for enterprise workloads.

Overall Architecture

StarRocks runs on Alibaba Cloud EMR with OSS as the unified object storage for Parquet, ORC, CSV, etc. The Data Lake Framework (DLF) manages metadata and builds the lake. To overcome OSS‑HDFS performance gaps, EMR introduces the Jindo FS system for accelerated access.

In the data development layer, StarRocks uses fixed BE nodes and elastic Compute Nodes (CN) that share the same execution engine. CNs can be deployed on Kubernetes and scale dynamically via HPA.

StarRocks overall architecture
StarRocks overall architecture

Iceberg Integration

StarRocks adds an external table type IcebergTable on the FE side and extends the Thrift RPC protocol to pass execution‑plan information to BE, where an HDFS scanner reads the data. This enables direct querying of Iceberg tables without materializing them in StarRocks.

Iceberg implementation
Iceberg implementation

Performance Comparison

Benchmarks using TPCH and Trino show that StarRocks outperforms Trino by several times, especially on Hudi datasets. The performance advantage stems from StarRocks' vectorized execution engine and new optimizer rules tailored for data‑lake workloads.

Key Optimizations

Predicate Push‑Down : Filters such as col_a > x are pushed to the scan operator, reducing scanned data.

Partition Pruning : Unnecessary partitions are eliminated at the FE, limiting BE scans to relevant partitions.

IO Merging : ColumnReader merges small column reads and small row‑group reads, decreasing network round‑trips.

Lazy Materialization : Columns without predicates are read lazily as LazyColumn, avoiding unnecessary IO.

Intelligent Caching

To mitigate metadata‑list latency, StarRocks implements fine‑grained caching of Hive partition and file metadata on the FE. The cache updates via an event‑driven model, and statistics are also cached to speed up planning.

Metadata cache
Metadata cache

Elastic Compute

StarRocks introduces stateless Compute Nodes (CN) that share the same BE code but run without local storage. CNs can be autoscaled in Kubernetes, providing elastic capacity while keeping data in OSS or HDFS.

Elastic compute nodes
Elastic compute nodes

Resource Isolation

StarRocks uses ResourceGroup to enforce CPU, memory, and IO limits per user, query, or IP, enabling soft isolation for multi‑tenant workloads without dedicated hardware.

Future Roadmap

The roadmap focuses on four pillars for data‑lake analytics: Single Source of Truth, high performance (sub‑second latency), elasticity, and cost‑effectiveness. Planned enhancements include tiered caching (memory + local disk), automatic eviction of cold data, and native materialized views that provide transparent acceleration and simplify data‑lake‑to‑warehouse workflows.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performance optimizationStarRocksdata lakeIcebergEMRElastic Compute
StarRocks
Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.