How We Built a Scalable Lakehouse Architecture with StarRocks, Paimon, and Flink
This article details the evolution of a data warehouse at RenliJia from a MaxCompute‑centric setup to a modern lakehouse using StarRocks, Paimon, Flink, and Fluss, describing design goals, technical evaluations, implementation steps for offline, OLAP, and real‑time workloads, and the challenges and future plans that emerged.
Author: Shiyu Yang (Thorne), Flink‑CDC Contributor, Senior Data Engineer at RenliJia.
1. Early Data Warehouse
Initially the internal data warehouse relied on MaxCompute for batch processing and DataWorks for ETL, with Flink VVR handling real‑time calculations. The system covered finance, operations, APP, CRM, GTM, and CS domains, but suffered from data silos, low freshness (T+1), costly repairs, and closed data formats.
2. Lakehouse Evolution
As new business lines demanded faster analytics, the existing stack could not meet the OLAP performance required by the HCM product. StarRocks was selected for its MPP architecture, Pipeline engine, CBO optimizer, and lakehouse design. Modeling used view tables, async materialized views, and generated columns to accelerate OLAP queries.
3. Final Lakehouse Shape
After a year of development, a sustainable, extensible lakehouse was built with open data formats, allowing any compute engine (Spark, Flink, StarRocks) to read the same data. The architecture supports seamless switching between batch, streaming, and incremental modes without data migration.
4. Lakehouse Evaluation
We compared several solutions:
MaxCompute – large‑scale batch but limited lake capabilities and tight cloud vendor lock‑in.
StarRocks – strong OLAP but still maturing for massive offline tasks.
Iceberg, Hudi, Delta Lake – solid lake formats but weaker streaming support.
Paimon – good for both batch and streaming, yet struggles with minute‑level freshness.
Multimodal data (AI) – requires additional support for vector search.
5. Technical Selection Criteria
Key factors were:
Support for both batch and streaming workloads.
Ability to handle diverse compute modes (partial column updates, aggregation tables, indexes).
Broad ecosystem compatibility (Flink, Spark, StarRocks).
Active open‑source community.
Data freshness (integration with Fluss to compensate for Paimon latency).
Multimodal data handling.
We chose Paimon as the core lake format, Fluss for low‑latency ingestion, and StarRocks for OLAP, all deployed via Alibaba Cloud’s serverless DLF service.
5.1 Offline Processing (DLF + MaxCompute)
SET odps.namespace.schema = true; -- Enable three‑layer data access model
INSERT OVERWRITE TABLE dwd_user_basic
SELECT corp_id, user_id, active, status, union_id, id,
CAST(gmt_create AS DATETIME) AS gmt_create,
CAST(gmt_modified AS DATETIME) AS gmt_modified
FROM paimon_catalog.xxdb_prod.user_basic_table
WHERE active = 1;We bypassed previous ETL jobs and read directly from the DLF data standard layer, leveraging MaxCompute’s three‑layer model to replace the ODS layer with minimal changes.
5.1.1 Aggregation Table (Agg Table)
To reduce the cost of periodic cumulative tables, we migrated Agg Table logic to Paimon, cutting the cost of high‑frequency event aggregation by 70%. Daily T+1 calculations now process only incremental data, and a Python node in DataWorks sleeps ten minutes before downstream scheduling to ensure data consistency.
import time
# Reserve compaction delay for Paimon
time.sleep(60 * 10)
print("Sleep 10 minutes finished")5.2 OLAP Scenario (DLF + StarRocks)
The lakehouse enables seamless integration of multiple data sources without binding to a single engine. Early modeling used async materialized view tables, which introduced latency (up to 15 minutes). Switching to logical view tables reduced latency by querying directly against the ODS layer.
5.2.1 Modeling Process
View tables store only DQL statements, keeping the model stateless and low‑cost. However, queries on view tables still scan the underlying ODS tables, which can be large and affect performance.
5.2.2 Materialized Views / Generated Columns
We accelerated OLAP queries by creating materialized views refreshed every ten minutes and generated columns that extract frequently accessed JSON fields, achieving query speeds comparable to native columns.
ALTER TABLE ods_hrm_basic ADD COLUMN dept_name STRING AS get_json_string(data, '$.dept_name') COMMENT 'Department Name';
SELECT corp_id, dept_name FROM ods_hrm_basic;5.2.3 Transparent View Rewrite
StarRocks automatically rewrites queries to use materialized views when possible, providing cost‑effective acceleration without user intervention.
5.3 Real‑Time Computing (DLF + Flink + Fluss)
We adopted Flink‑CDC‑Yaml to capture changes from PolarDB/MySQL into Paimon, separating CDC transport from downstream merging handled by DLF. The solution offers lightweight development, schema change support, route handling for sharded tables, and exactly‑once guarantees.
source:
type: mysql
hostname: localhost
port: 3306
username: root
password: "123456"
tables: "app_db\\..*"
server-id: "5400-5404"
server-time-zone: UTC
sink:
type: starrocks
name: StarRocks Sink
jdbc-url: "jdbc:mysql://127.0.0.1:9030"
load-url: "127.0.0.1:8080"
username: root
password: ""
table.create.properties.replication_num: 1
pipeline:
name: Sync MySQL Database to StarRocks
parallelism: 2Fluss complements Paimon by providing minute‑level freshness; Flink‑CDC streams data into Fluss, which then writes to Paimon. For ad‑hoc queries, we use Alibaba Cloud EMR‑Serverless‑StarRocks with Union Read to query Fluss tables directly.
6. Current Achievements and Issues
Key benefits:
Single source of truth eliminates data duplication.
Open data formats decouple storage from compute.
Unified data supports batch, streaming, and incremental workloads.
Scalable architecture accommodates multimodal AI data.
Freshness achieved through flexible compute mode switching.
Open problems include predicate push‑down limitations in MaxCompute, inconsistent mapper counts, uncertain DLF merge latency, DLF’s lack of consumer management for Paimon tables, and missing support for AGG tables and RoaringBitmap in Fluss.
7. Future Plans
We intend to replace Kafka with Fluss once its Kafka compatibility is stable, integrate Lance as an AI‑focused data format, and enable Fluss to support AGG tables and RoaringBitmap for low‑cost UV calculations. StarRocks’ compute‑storage separation will provide hard resource isolation for AI workloads, and we will continue to pursue incremental computation capabilities within StarRocks.
References: Flink‑Forward, StarRocks documentation, Fluss project, Apache Flink CDC docs, and internal Alibaba Cloud resources.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
