Big Data 19 min read

How RedBI Boosted Query Speed 3× with StarRocks & Iceberg Lakehouse

The article details how Xiaohongshu's RedBI self‑service analytics platform transformed its architecture by integrating StarRocks and Iceberg, replacing ClickHouse‑based storage with Parquet, introducing DataCache, Z‑Order sorting and intelligent key selection, achieving a three‑fold P90 query speed improvement, sub‑10‑second latency, and halving storage consumption.

StarRocks

Feb 20, 2025

Background and Original Architecture

RedBI is Xiaohongshu's visual, ad‑hoc query tool designed for analysts. The original system relied on a custom ClickHouse compute‑storage separation solution, using Hive2CK ETL jobs to move data from Hive to ClickHouse and caching hot data locally. This design suffered from three main pain points: degrading query performance as index strategies became stale, limited Join capabilities due to ClickHouse’s bucket‑node coupling, and inflexible data access that could impact online services.

Lakehouse Architecture Upgrade

To address these issues, the platform migrated to a lakehouse model:

Replaced Hive2CK with Hive2Iceberg , storing data in Parquet format.

Adopted StarRocks as the OLAP engine, leveraging its elastic scaling and fast response.

Enabled DataCache to cache hot data on BE nodes, eliminating repeated remote I/O.

Introduced Z‑Order sorting and an intelligent sorting‑key selection mechanism to keep hot columns optimally ordered.

These changes reduced actual storage space by roughly 50% while keeping row counts unchanged.

Performance Gains

In a POC using the top‑10 hottest datasets, the new lakehouse achieved a 3× improvement in P90 query latency , stabilising response times under 10 seconds. DataCache contributed an additional ≈20% P90 latency reduction and cut network bandwidth usage by 2‑3×.

Key Technical Enhancements

Audit‑log plugin extension : added Iceberg ScanReport fields (e.g., nonPartFilterCols, resultDataFiles, dataInBytes) to capture per‑query column usage.

Intelligent sort‑column selection : combined audit‑log data with table statistics to estimate user habits, automatically choosing columns with high NDV, frequency (>15% of queries), sufficient file count, and top‑3 query rank.

Async Z‑Order rewrite : DLF triggers rewrite_zorder_dataFiles tasks after Hive2Iceberg jobs finish; Iceberg’s atomic commit ensures zero impact on analysts.

Flexible JoinKey strategy : introduced configurable JoinKey with Broadcast and Shuffle joins, supporting up to four dimension tables per fact table to avoid data skew while remaining independent of bucket count.

Data Skipping via Iceberg min‑max indexes : Iceberg provides per‑column min‑max metadata, enabling predicate push‑down at the DataFile level. Example index fields include:

DataFile.RECORD_COUNT, DataFile.FILE_SIZE, DataFile.COLUMN_SIZES, DataFile.VALUE_COUNTS, DataFile.NULL_VALUE_COUNTS, DataFile.NAN_VALUE_COUNTS, DataFile.LOWER_BOUNDS, DataFile.UPPER_BOUNDS, ...

Z‑Order Sorting Explained

Z‑Order (space‑filling curve) interleaves binary representations of multi‑dimensional coordinates to produce a linear index, improving range‑query performance. For a 2‑D point (x=3 → 011, y=5 → 101), the Z‑Order key becomes 0 1 1 0 1 1.

Sorting‑Column Selection Rules

Unique value count (NDV) ≥ 15.

Frequency proportion ≥ 0.15 of total table queries.

Partition must contain > 10 files.

Column must rank within the top‑3 most‑used columns.

Overall query share must be significant; low‑share columns are excluded.

Large‑Query Optimization

StarRocks now supports EXPLAIN ESTIMATE for Iceberg tables. If the estimated data volume exceeds a threshold, the query is automatically routed to a smaller StarRocks cluster or a Spark cluster, preventing resource exhaustion and stabilising CPU utilisation and P90 latency.

Project Benefits

Since the upgrade, P90 response time has improved three‑fold, consistently staying below 10 seconds. Storage efficiency doubled thanks to Parquet‑based Iceberg tables. The platform now offers greater self‑service flexibility, faster joins, and robust scaling.

Future Roadmap

The team plans to explore a near‑real‑time lakehouse by combining StarRocks with Paimon and to design optimisations for primary‑key‑driven lake analytics scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

StarRocks Iceberg Self‑service analytics Lakehouse Z-Order DataCache

Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.