Big Data 18 min read

Mastering StarRocks & Apache Paimon: A Fast‑Track Lakehouse Guide

This guide provides a comprehensive overview of Apache Paimon’s architecture, key features, and advantages, explains how to integrate it with StarRocks for real‑time lakehouse analytics, and walks through a complete quick‑start setup including component installation, Flink and Kafka deployment, data ingestion, table creation, and query execution with time‑travel support.

StarRocks

Aug 14, 2024

Apache Paimon Overview

Apache Paimon is an ACID‑compliant data‑lake storage system that originated as a sub‑project of Apache Flink and graduated to a top‑level Apache project in 2024. It stores data using an LSM‑Tree layout combined with lake‑format files, enabling efficient real‑time updates and compaction while supporting both batch and streaming workloads.

Paimon Architecture & Key Features

Unified batch and streaming : A single storage format can be used for both processing paradigms.

Schema evolution : Table schemas can be altered without full data rewrites.

ACID transactions : Guarantees atomicity, consistency, isolation and durability.

Time travel : Historical snapshots can be queried for auditing or debugging.

Ecosystem integration : Native connectors for Flink, Spark and Hive.

Table Models

Primary Key Table : Supports insert, update and delete; rows with the same primary key are merged automatically.

Append Table : No primary key; behaves like a log‑type table where duplicate rows are retained.

Append Queue : A special Append Table that enforces strict ordering per bucket, similar to a Kafka partition, suitable for pipeline and streaming use cases.

Time Travel

Implemented via snapshot files. Users can query a table as of a specific snapshot ID using the OPTIONS('scan.snapshot-id' = 'N') hint.

Compaction Strategy

Paimon adopts RocksDB‑style universal compaction with two default strategies:

Leveled compaction (RocksDB default).

Size‑tiered compaction, which groups rowsets of similar size to reduce write amplification.

StarRocks × Paimon Integration

Supports HDFS and object storage (S3/OSS).

Metadata management via Hive Metastore (HMS) and Alibaba Cloud DLF.

Query both Primary Key and Append‑Only tables.

Access to Paimon system tables (e.g., read‑optimized, snapshots).

Cross‑format joins between Paimon and other lake formats.

Cross‑engine joins between Paimon tables and StarRocks internal tables.

Data cache acceleration.

Materialized view creation on Paimon tables for transparent query acceleration.

Delete‑vector support for faster deletions.

Quick‑Start Setup

Component Versions

StarRocks: 3.3.0

Flink: 1.19.1

Paimon: 0.8.2

Kafka: 3.7.0

Download & Install

wget "http://mirrors.cloud.aliyuncs.com/apache/flink/flink-1.19.1/flink-1.19.1-bin-scala_2.12.tgz"
 tar -xf flink-1.19.1-bin-scala_2.12.tgz
wget "https://repo.maven.apache.org/maven2/org/apache/paimon/paimon-flink-1.19/0.8.2/paimon-flink-1.19-0.8.2.jar"
wget "https://repo.maven.apache.org/maven2/org/apache/paimon/paimon-oss/0.8.2/paimon-oss-0.8.2.jar"
wget "https://repo.maven.apache.org/maven2/org/apache/flink/flink-shaded-hadoop-2-uber/2.7.5-10.0/flink-shaded-hadoop-2-uber-2.7.5-10.0.jar"
wget "https://repo1.maven.org/maven2/org/apache/flink/flink-sql-connector-kafka/3.2.0-1.18/flink-sql-connector-kafka-3.2.0-1.18.jar"
wget "https://github.com/StarRocks/starrocks-connector-for-apache-flink/releases/download/v1.2.9/flink-connector-starrocks-1.2.9_flink-1.18.jar"
cp paimon-flink-1.19-0.8.2.jar paimon-oss-0.8.2.jar flink-shaded-hadoop-2-uber-2.7.5-10.0.jar flink-connector-starrocks-1.2.9_flink-1.18.jar flink-sql-connector-kafka-3.2.0-1.18.jar flink-1.19.1/lib/

Start Flink Cluster

cd flink-1.19.1
# Edit conf/config.yaml and set numberOfTaskSlots: 10
./bin/start-cluster.sh

Kafka Deployment

wget "http://mirrors.cloud.aliyuncs.com/apache/kafka/3.7.0/kafka_2.12-3.7.0.tgz"
 tar -xf kafka_2.12-3.7.0.tgz
./bin/zookeeper-server-start.sh -daemon ./config/zookeeper.properties
./bin/kafka-server-start.sh -daemon ./config/server.properties

Demo Data Generation

./bin/kafka-topics.sh --create --topic order-details --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1

from kafka import KafkaProducer
import time, json, random
from datetime import datetime, timedelta

start_time = datetime(2024,7,24,15,0,0)
producer = KafkaProducer(bootstrap_servers=['localhost:9092'])
while True:
    order_id = random.randint(1,10000)
    user_id = random.randint(1,8)
    order_amount = round(random.uniform(10.0,1000.0),2)
    random_time = start_time + timedelta(seconds=random.randint(0,3600))
    data = {"order_id":order_id,"user_id":user_id,"order_amount":order_amount,"order_time":random_time.strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]}
    producer.send('order-details', value=json.dumps(data).encode('utf-8'))
    time.sleep(3)
producer.close()

Create Tables & Load Data

CREATE TABLE `users` (
  `user_id` BIGINT NOT NULL,
  `region` VARCHAR(65533) NULL
) ENGINE=OLAP PRIMARY KEY(`user_id`) DISTRIBUTED BY HASH(`user_id`);
INSERT INTO users VALUES (1,'BeiJing'),(2,'TianJin'),(3,'XiAn'),(4,'ShenZhen'),(5,'BeiJing'),(6,'BeiJing'),(7,'ShenZhen'),(8,'ShenZhen');

CREATE CATALOG my_catalog_oss WITH (
  'type' = 'paimon',
  'warehouse' = 'oss://starrocks-public/dba/jingdan/paimon',
  'fs.oss.endpoint' = 'oss-cn-zhangjiakou-internal.aliyuncs.com',
  'fs.oss.accessKeyId' = 'ak',
  'fs.oss.accessKeySecret' = 'sk'
);
USE CATALOG my_catalog_oss;
CREATE TABLE hourly_regional_sales (
  event_time TIMESTAMP(3),
  region STRING,
  total_sales DECIMAL(10,2)
);
USE CATALOG default_catalog;
CREATE TABLE orders_kafka (
  order_id BIGINT,
  user_id BIGINT,
  order_amount DECIMAL(10,2),
  order_time TIMESTAMP(3),
  WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND
) WITH (
  'connector' = 'kafka',
  'topic' = 'order-details',
  'properties.bootstrap.servers' = 'localhost:9092',
  'properties.group.id' = 'order-consumer',
  'format' = 'json',
  'scan.startup.mode' = 'latest-offset'
);
CREATE TABLE users_starrocks (
  user_id BIGINT,
  region STRING
) WITH (
  'connector'='starrocks',
  'scan-url'='172.26.92.154:8030',
  'jdbc-url'='jdbc:mysql://172.26.92.154:9030',
  'username'='root',
  'password'='xxx',
  'database-name'='jd',
  'table-name'='users'
);
SET 'execution.checkpointing.interval' = '10 s';
INSERT INTO my_catalog_oss.`default`.hourly_regional_sales
SELECT TUMBLE_START(order_time, INTERVAL '5' MINUTE) AS event_time,
       u.region,
       CAST(SUM(o.order_amount) AS DECIMAL(10,2)) AS total_sales
FROM default_catalog.`default_database`.orders_kafka AS o
JOIN default_catalog.`default_database`.users_starrocks AS u ON o.user_id = u.user_id
GROUP BY TUMBLE(order_time, INTERVAL '5' MINUTE), u.region;

Query & Time‑Travel

SELECT * FROM my_catalog_oss.`default`.hourly_regional_sales;

SET 'execution.runtime-mode' = 'batch';
SELECT * FROM hourly_regional_sales /*+ OPTIONS('scan.snapshot-id' = '2') */;

SET 'execution.runtime-mode' = 'batch';
SELECT * FROM hourly_regional_sales /*+ OPTIONS('incremental-between' = '5,10') */;

Create StarRocks Paimon Catalog

CREATE EXTERNAL CATALOG paimon_catalog_oss
PROPERTIES (
  "type" = "paimon",
  "paimon.catalog.type" = "filesystem",
  "paimon.catalog.warehouse" = "oss://starrocks-public/dba/jingdan/paimon",
  "aliyun.oss.access_key" = "ak",
  "aliyun.oss.secret_key" = "sk",
  "aliyun.oss.endpoint" = "oss-cn-zhangjiakou-internal.aliyuncs.com"
);
SET CATALOG paimon_catalog_oss;
USE `default`;
SELECT * FROM hourly_regional_sales;

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data engineering Flink SQL StarRocks Kafka Lakehouse Apache Paimon

Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.