Mastering StarRocks & Apache Paimon: A Fast‑Track Lakehouse Guide
This guide provides a comprehensive overview of Apache Paimon’s architecture, key features, and advantages, explains how to integrate it with StarRocks for real‑time lakehouse analytics, and walks through a complete quick‑start setup including component installation, Flink and Kafka deployment, data ingestion, table creation, and query execution with time‑travel support.
Apache Paimon Overview
Apache Paimon is an ACID‑compliant data‑lake storage system that originated as a sub‑project of Apache Flink and graduated to a top‑level Apache project in 2024. It stores data using an LSM‑Tree layout combined with lake‑format files, enabling efficient real‑time updates and compaction while supporting both batch and streaming workloads.
Paimon Architecture & Key Features
Unified batch and streaming : A single storage format can be used for both processing paradigms.
Schema evolution : Table schemas can be altered without full data rewrites.
ACID transactions : Guarantees atomicity, consistency, isolation and durability.
Time travel : Historical snapshots can be queried for auditing or debugging.
Ecosystem integration : Native connectors for Flink, Spark and Hive.
Table Models
Primary Key Table : Supports insert, update and delete; rows with the same primary key are merged automatically.
Append Table : No primary key; behaves like a log‑type table where duplicate rows are retained.
Append Queue : A special Append Table that enforces strict ordering per bucket, similar to a Kafka partition, suitable for pipeline and streaming use cases.
Time Travel
Implemented via snapshot files. Users can query a table as of a specific snapshot ID using the OPTIONS('scan.snapshot-id' = 'N') hint.
Compaction Strategy
Paimon adopts RocksDB‑style universal compaction with two default strategies:
Leveled compaction (RocksDB default).
Size‑tiered compaction, which groups rowsets of similar size to reduce write amplification.
StarRocks × Paimon Integration
Supports HDFS and object storage (S3/OSS).
Metadata management via Hive Metastore (HMS) and Alibaba Cloud DLF.
Query both Primary Key and Append‑Only tables.
Access to Paimon system tables (e.g., read‑optimized, snapshots).
Cross‑format joins between Paimon and other lake formats.
Cross‑engine joins between Paimon tables and StarRocks internal tables.
Data cache acceleration.
Materialized view creation on Paimon tables for transparent query acceleration.
Delete‑vector support for faster deletions.
Quick‑Start Setup
Component Versions
StarRocks: 3.3.0
Flink: 1.19.1
Paimon: 0.8.2
Kafka: 3.7.0
Download & Install
wget "http://mirrors.cloud.aliyuncs.com/apache/flink/flink-1.19.1/flink-1.19.1-bin-scala_2.12.tgz"
tar -xf flink-1.19.1-bin-scala_2.12.tgz
wget "https://repo.maven.apache.org/maven2/org/apache/paimon/paimon-flink-1.19/0.8.2/paimon-flink-1.19-0.8.2.jar"
wget "https://repo.maven.apache.org/maven2/org/apache/paimon/paimon-oss/0.8.2/paimon-oss-0.8.2.jar"
wget "https://repo.maven.apache.org/maven2/org/apache/flink/flink-shaded-hadoop-2-uber/2.7.5-10.0/flink-shaded-hadoop-2-uber-2.7.5-10.0.jar"
wget "https://repo1.maven.org/maven2/org/apache/flink/flink-sql-connector-kafka/3.2.0-1.18/flink-sql-connector-kafka-3.2.0-1.18.jar"
wget "https://github.com/StarRocks/starrocks-connector-for-apache-flink/releases/download/v1.2.9/flink-connector-starrocks-1.2.9_flink-1.18.jar"
cp paimon-flink-1.19-0.8.2.jar paimon-oss-0.8.2.jar flink-shaded-hadoop-2-uber-2.7.5-10.0.jar flink-connector-starrocks-1.2.9_flink-1.18.jar flink-sql-connector-kafka-3.2.0-1.18.jar flink-1.19.1/lib/Start Flink Cluster
cd flink-1.19.1
# Edit conf/config.yaml and set numberOfTaskSlots: 10
./bin/start-cluster.shKafka Deployment
wget "http://mirrors.cloud.aliyuncs.com/apache/kafka/3.7.0/kafka_2.12-3.7.0.tgz"
tar -xf kafka_2.12-3.7.0.tgz
./bin/zookeeper-server-start.sh -daemon ./config/zookeeper.properties
./bin/kafka-server-start.sh -daemon ./config/server.propertiesDemo Data Generation
./bin/kafka-topics.sh --create --topic order-details --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1 from kafka import KafkaProducer
import time, json, random
from datetime import datetime, timedelta
start_time = datetime(2024,7,24,15,0,0)
producer = KafkaProducer(bootstrap_servers=['localhost:9092'])
while True:
order_id = random.randint(1,10000)
user_id = random.randint(1,8)
order_amount = round(random.uniform(10.0,1000.0),2)
random_time = start_time + timedelta(seconds=random.randint(0,3600))
data = {"order_id":order_id,"user_id":user_id,"order_amount":order_amount,"order_time":random_time.strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]}
producer.send('order-details', value=json.dumps(data).encode('utf-8'))
time.sleep(3)
producer.close()Create Tables & Load Data
CREATE TABLE `users` (
`user_id` BIGINT NOT NULL,
`region` VARCHAR(65533) NULL
) ENGINE=OLAP PRIMARY KEY(`user_id`) DISTRIBUTED BY HASH(`user_id`);
INSERT INTO users VALUES (1,'BeiJing'),(2,'TianJin'),(3,'XiAn'),(4,'ShenZhen'),(5,'BeiJing'),(6,'BeiJing'),(7,'ShenZhen'),(8,'ShenZhen'); CREATE CATALOG my_catalog_oss WITH (
'type' = 'paimon',
'warehouse' = 'oss://starrocks-public/dba/jingdan/paimon',
'fs.oss.endpoint' = 'oss-cn-zhangjiakou-internal.aliyuncs.com',
'fs.oss.accessKeyId' = 'ak',
'fs.oss.accessKeySecret' = 'sk'
);
USE CATALOG my_catalog_oss;
CREATE TABLE hourly_regional_sales (
event_time TIMESTAMP(3),
region STRING,
total_sales DECIMAL(10,2)
);
USE CATALOG default_catalog;
CREATE TABLE orders_kafka (
order_id BIGINT,
user_id BIGINT,
order_amount DECIMAL(10,2),
order_time TIMESTAMP(3),
WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND
) WITH (
'connector' = 'kafka',
'topic' = 'order-details',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'order-consumer',
'format' = 'json',
'scan.startup.mode' = 'latest-offset'
);
CREATE TABLE users_starrocks (
user_id BIGINT,
region STRING
) WITH (
'connector'='starrocks',
'scan-url'='172.26.92.154:8030',
'jdbc-url'='jdbc:mysql://172.26.92.154:9030',
'username'='root',
'password'='xxx',
'database-name'='jd',
'table-name'='users'
);
SET 'execution.checkpointing.interval' = '10 s';
INSERT INTO my_catalog_oss.`default`.hourly_regional_sales
SELECT TUMBLE_START(order_time, INTERVAL '5' MINUTE) AS event_time,
u.region,
CAST(SUM(o.order_amount) AS DECIMAL(10,2)) AS total_sales
FROM default_catalog.`default_database`.orders_kafka AS o
JOIN default_catalog.`default_database`.users_starrocks AS u ON o.user_id = u.user_id
GROUP BY TUMBLE(order_time, INTERVAL '5' MINUTE), u.region;Query & Time‑Travel
SELECT * FROM my_catalog_oss.`default`.hourly_regional_sales; SET 'execution.runtime-mode' = 'batch';
SELECT * FROM hourly_regional_sales /*+ OPTIONS('scan.snapshot-id' = '2') */; SET 'execution.runtime-mode' = 'batch';
SELECT * FROM hourly_regional_sales /*+ OPTIONS('incremental-between' = '5,10') */;Create StarRocks Paimon Catalog
CREATE EXTERNAL CATALOG paimon_catalog_oss
PROPERTIES (
"type" = "paimon",
"paimon.catalog.type" = "filesystem",
"paimon.catalog.warehouse" = "oss://starrocks-public/dba/jingdan/paimon",
"aliyun.oss.access_key" = "ak",
"aliyun.oss.secret_key" = "sk",
"aliyun.oss.endpoint" = "oss-cn-zhangjiakou-internal.aliyuncs.com"
);
SET CATALOG paimon_catalog_oss;
USE `default`;
SELECT * FROM hourly_regional_sales;Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
StarRocks
StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
