Big Data 24 min read

How Paimon + StarRocks Accelerates Double‑11 OLAP Queries by 80% Refresh Speed

This article explains how Taotian Group unified real‑time and offline data using Paimon as lake storage and StarRocks for high‑performance OLAP, eliminating costly sync pipelines, cutting refresh time by about 80%, saving nearly ten million yuan annually, and detailing the architecture, cluster safeguards, configuration tweaks, monitoring, and future roadmap for large‑scale promotional events.

StarRocks
StarRocks
StarRocks
How Paimon + StarRocks Accelerates Double‑11 OLAP Queries by 80% Refresh Speed

Background and Motivation

During major promotional events such as Double 11, Taotian Group experiences a sudden surge in OLAP query traffic, demanding high stability, low cost, and robust governance for both real‑time and batch data products. Traditional multi‑storage, multi‑link pipelines with back‑filling become bottlenecks in development efficiency, cost, and latency.

Current Data Architecture

Data flows from the DWD layer through two main streams:

Real‑time data is stored in TT (similar to Kafka).

Offline data resides in ODPS.

Flink stream‑batch jobs continuously consume TT for real‑time processing and ODPS for batch processing, joining various dimension tables (e.g., category, merchant hierarchy) and finally writing results to Holo tables in the ADS layer for downstream services.

In pure offline scenarios, ODPS reads and writes to ADS tables, optionally loading data into Holo for query acceleration. Query latency requirements dictate whether Holo (millisecond‑level) or MC (hundreds of milliseconds to seconds) is used.

Business Pain Points

Fragmented storage : Real‑time data in TT, offline data in ODPS, and occasional Holo copies increase cost and complexity.

Complex development pipeline : Data must traverse multiple storage media, extending end‑to‑end latency.

Shuffle‑heavy OLAP queries suffer from poor performance, especially when joining large dimension tables.

High back‑fill cost for dimension table changes.

Near‑real‑time requirements (e.g., cross‑day UV) are often downgraded to hourly granularity due to cost and state size.

Core Strategy

1) Architecture Simplification and Efficiency

Unify storage by persisting both real‑time and offline data into Paimon lake tables.

StarRocks directly queries Paimon, removing sync pipelines and duplicate storage.

2) Lowering Usage Barriers

Paimon provides explicit schema, enabling analysts to self‑serve near‑real‑time data via StarRocks without heavy engineering effort.

Paimon + StarRocks Practice

After unifying storage, Paimon holds minute‑level real‑time data and historical partitions. Fluss (a second‑level real‑time stream) feeds Paimon every 3 minutes by default, configurable per user. Flink jobs read both current‑day and historical Paimon data, writing results back to ADS and DWD layers.

Paimon’s partial‑update capability allows building wide tables that store multiple object states (e.g., order status) in a single row, simplifying downstream reads.

StarRocks reads from either the ADS layer (point‑lookup) or the DWD intermediate layer (online computation). Even heavy queries remain within second‑level latency, and StarRocks can join Paimon dimension tables on‑the‑fly.

Cluster‑Side Safeguards

Data‑cache window of 180 seconds to deduplicate identical queries.

Global query timeout of 30 seconds; slow SQLs are terminated.

Warehouse isolation based on business importance (default, heavy‑protect, BI‑dedicated).

Initial cluster parameters (examples):

set global cbo_cte_reuse_rate=0;
set global query_timeout=30;
set global new_planner_optimize_timeout=10000;
set global pipeline_dop=8;
set global scan_paimon_partition_num_limit=100;

Monitoring and Alerting

Key metrics (CPU, memory, availability, queue length, query latency, failure counts) are tracked per BE/CN and FE. Alerts trigger when thresholds (e.g., CPU > 70 %, queue > 2000) are exceeded. Audit logs are stored in StarRocks internal tables, enabling real‑time SQL execution monitoring and downstream metadata dashboards.

Example audit query:

select * from _starrocks_audit_db_.starrocks_audit_tbl;

Performance Optimizations Discovered in Stress Tests

Partition pruning failures – enforce proper partition filters or date functions; avoid comparisons with sub‑query results.

Excessive small files in Paimon – enable sorting via clustering columns and use branch tables; set sink parallelism hint: /*+ OPTIONS('sink.parallelism' = '64') */.

Missing broadcast for small dimension tables – add [broadcast] hint to force map‑join, reducing latency from tens of seconds to ~3 seconds.

Cross‑region access latency – co‑locate StarRocks and Paimon storage in the same region.

Deletion‑vector for primary‑key tables – enable 'deletion-vectors.enabled'='true' to skip logically deleted rows.

Results and Future Plans

Data pipeline simplified; storage cost and complexity reduced.

Data usage barrier lowered; analysts can self‑serve near‑real‑time data.

Back‑fill time cut by ~80 %, saving ~10 million CNY annually.

Cross‑day real‑time UV computed cheaply using RoaringBitmap, meeting promotional decision‑making needs.

Future directions include stronger automatic materialization in StarRocks, richer metadata capabilities, improved scheduler CPU balancing, and direct Fluss ingestion for sub‑second latency.

Key Images

Current Data Architecture
Current Data Architecture
New Data Architecture
New Data Architecture
Cluster Cache Strategy
Cluster Cache Strategy
Alert Rules
Alert Rules
Core Metric Dashboard
Core Metric Dashboard
Promotion Safeguard
Promotion Safeguard
Partition Pruning Issue
Partition Pruning Issue
Small File Issue
Small File Issue
big dataReal-time analyticsStarRocksPerformance TuningOLAPPaimondata architecture
StarRocks
Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.