Big Data 24 min read

Building a High‑Performance, Cost‑Effective Cloud Lakehouse with StarRocks and EMR

This article explains how to design and implement a cloud‑native Lakehouse using StarRocks and Tencent Cloud EMR, covering core technical requirements, a five‑layer architecture, data ingestion with Iceberg/Hudi, performance tricks like Z‑order clustering, cost‑control through elastic scaling, and the key product features of EMR StarRocks.

StarRocks

Nov 4, 2022

Lakehouse Core Objectives

Single source of truth – eliminate data duplication across analytical workloads.

Low storage cost – store massive datasets in cheap, durable object storage.

Unified architecture – reduce operational complexity and maintenance overhead.

Efficient compute & I/O – support incremental updates to avoid costly full‑reprocess ETL.

Technical Foundations

Unified data format (Apache Iceberg or Apache Hudi) providing ACID transactions and multi‑version control.

Unified execution engine: Spark for batch ETL, StarRocks for low‑latency analytical queries.

Unified data management: metadata, quality, lineage, indexing, snapshot & savepoint handling.

Unified storage: cloud object storage (e.g., COS, CHDFS) with high durability and pay‑as‑you‑go pricing.

Five‑Layer Cloud Lakehouse Architecture

Compute resource layer – CVM, bare‑metal or containers provide elastic compute capacity.

Cloud storage layer – object storage, cloud HDFS or file storage for cheap, durable data.

Data‑lake storage layer (table format) – Iceberg or Hudi tables optimized for cloud storage.

Unified compute engine layer – Spark handles batch ETL; StarRocks executes fast analytical queries.

Unified data‑management layer – Wedata for governance; EMR Lake Service for small‑file merging, snapshot cleanup, clustering and index management.

Data Ingestion & Table Formats

Three data categories are supported:

Transactional DB data (CRM, ERP, etc.) – ingest via EMR Sqoop or Spark.

Log data – ingest via EMRFlow, Message Service or DataInLong.

Time‑series / IoT data – ingest via EMR, Spark or Oceanus streams.

Iceberg and Hudi are the primary table formats. EMR Lake Service automates:

Small‑file merging.

Snapshot / savepoint cleanup.

Data clustering (Z‑order) and index creation.

Performance Optimizations

StarRocks applies several techniques to accelerate queries on cloud storage:

CPU and partition pruning at the planner stage.

Cost‑Based Optimizer (CBO) using column statistics.

Vectorized execution engine with SIMD instructions.

Native ORC reader for efficient columnar reads.

Data is clustered using Z‑order (multi‑dimensional space‑filling curve) to co‑locate related rows, dramatically reducing the number of scanned files. In a benchmark on a 1.78 TB dataset (≈210 M rows, 200 CPU cores), query time dropped from 1030 s to 29 s (≈10× speedup) and scanned data reduced to ~40 % of the original volume.

Performance comparison of Z‑order vs. random scan

StarRocks Architecture Details

StarRocks consists of two process types:

FE (Front‑End) – manages metadata, client connections, query planning and partition pruning. FE nodes operate in a leader‑follower model; the leader handles metadata writes while followers serve read requests.

BE (Back‑End) – stores columnar data and executes SQL. BE uses a vectorized engine, SIMD acceleration, and supports pluggable storage adapters (object storage, CHDFS, etc.).

Additional cloud‑native optimizations:

LocalCache – block‑level cache on BE nodes to reduce object‑storage latency.

Materialized Views – pre‑compute frequent result sets; FE rewrites queries to use the materialized view.

Cost Control & Elastic Scaling

By separating storage (object storage) from compute (FE/BE or CN nodes), only a minimal set of BE nodes retain intermediate results while compute nodes scale elastically using spot CVM instances or containers.

EMR provides three elasticity modes:

Managed auto‑scaling – set min/max node counts; the system automatically adjusts capacity based on load.

Rule‑based auto‑scaling – define scaling rules (e.g., CPU usage, query queue length) or time‑based schedules.

Manual scaling with graceful shrinkage – manually add/remove nodes; EMR gracefully migrates data and avoids query failures.

EMR Lake Service Capabilities

Automated small‑file merging to avoid the “small file problem”.

Snapshot and savepoint management for point‑in‑time data recovery.

Clustering (Z‑order) and index creation to improve query pruning.

Migration tool to convert legacy Hive tables to Iceberg.

Iceberg Z‑Order Support in EMR

EMR’s Iceberg implementation extends Z‑order to primitive types (int, string, date, varchar) and complex types (map, struct). Users can create Z‑order indexes via SQL, e.g.:

CREATE TABLE sales (
  order_id BIGINT,
  region STRING,
  order_date DATE,
  amount DOUBLE
) USING iceberg
PARTITIONED BY (region)
ZORDER BY (order_date, amount);

Performance tests on 10 × 8‑core, 32 GB RAM instances with SSD disks show 10× query speedup compared with random scans and 2‑5× improvement over simple small‑file merging.

Savepoint Feature

Savepoint marks a set of files as immutable snapshots, preventing cleanup processes from deleting them. This enables point‑in‑time recovery and supports “stream‑batch” fusion: Spark, Hive and StarRocks can all read data as of a specific savepoint without changing the upstream pipeline.

EMR StarRocks Product Features

Cluster operations – create, expand, shrink, and upgrade clusters via a unified UI; supports version self‑upgrade.

Cluster management – centralized FE/BE configuration, custom config files, and multi‑cloud data source integration.

Process management – secure start/stop, health monitoring, and advanced operations.

APM & monitoring – end‑to‑end metrics, alerts, log search, and automated health checks.

Big Data cloud computing StarRocks Iceberg lakehouse Hudi EMR

Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.