How RedBI Boosted Query Speed 3× with StarRocks & Iceberg Lakehouse
The article details how Xiaohongshu's RedBI self‑service analytics platform transformed its architecture by integrating StarRocks and Iceberg, replacing ClickHouse‑based storage with Parquet, introducing DataCache, Z‑Order sorting and intelligent key selection, achieving a three‑fold P90 query speed improvement, sub‑10‑second latency, and halving storage consumption.
Background and Original Architecture
RedBI is Xiaohongshu's visual, ad‑hoc query tool designed for analysts. The original system relied on a custom ClickHouse compute‑storage separation solution, using Hive2CK ETL jobs to move data from Hive to ClickHouse and caching hot data locally. This design suffered from three main pain points: degrading query performance as index strategies became stale, limited Join capabilities due to ClickHouse’s bucket‑node coupling, and inflexible data access that could impact online services.
Lakehouse Architecture Upgrade
To address these issues, the platform migrated to a lakehouse model:
Replaced Hive2CK with Hive2Iceberg , storing data in Parquet format.
Adopted StarRocks as the OLAP engine, leveraging its elastic scaling and fast response.
Enabled DataCache to cache hot data on BE nodes, eliminating repeated remote I/O.
Introduced Z‑Order sorting and an intelligent sorting‑key selection mechanism to keep hot columns optimally ordered.
These changes reduced actual storage space by roughly 50% while keeping row counts unchanged.
Performance Gains
In a POC using the top‑10 hottest datasets, the new lakehouse achieved a 3× improvement in P90 query latency , stabilising response times under 10 seconds. DataCache contributed an additional ≈20% P90 latency reduction and cut network bandwidth usage by 2‑3×.
Key Technical Enhancements
Audit‑log plugin extension : added Iceberg ScanReport fields (e.g., nonPartFilterCols, resultDataFiles, dataInBytes) to capture per‑query column usage.
Intelligent sort‑column selection : combined audit‑log data with table statistics to estimate user habits, automatically choosing columns with high NDV, frequency (>15% of queries), sufficient file count, and top‑3 query rank.
Async Z‑Order rewrite : DLF triggers rewrite_zorder_dataFiles tasks after Hive2Iceberg jobs finish; Iceberg’s atomic commit ensures zero impact on analysts.
Flexible JoinKey strategy : introduced configurable JoinKey with Broadcast and Shuffle joins, supporting up to four dimension tables per fact table to avoid data skew while remaining independent of bucket count.
Data Skipping via Iceberg min‑max indexes : Iceberg provides per‑column min‑max metadata, enabling predicate push‑down at the DataFile level. Example index fields include:
DataFile.RECORD_COUNT, DataFile.FILE_SIZE, DataFile.COLUMN_SIZES, DataFile.VALUE_COUNTS, DataFile.NULL_VALUE_COUNTS, DataFile.NAN_VALUE_COUNTS, DataFile.LOWER_BOUNDS, DataFile.UPPER_BOUNDS, ...Z‑Order Sorting Explained
Z‑Order (space‑filling curve) interleaves binary representations of multi‑dimensional coordinates to produce a linear index, improving range‑query performance. For a 2‑D point (x=3 → 011, y=5 → 101), the Z‑Order key becomes 0 1 1 0 1 1.
Sorting‑Column Selection Rules
Unique value count (NDV) ≥ 15.
Frequency proportion ≥ 0.15 of total table queries.
Partition must contain > 10 files.
Column must rank within the top‑3 most‑used columns.
Overall query share must be significant; low‑share columns are excluded.
Large‑Query Optimization
StarRocks now supports EXPLAIN ESTIMATE for Iceberg tables. If the estimated data volume exceeds a threshold, the query is automatically routed to a smaller StarRocks cluster or a Spark cluster, preventing resource exhaustion and stabilising CPU utilisation and P90 latency.
Project Benefits
Since the upgrade, P90 response time has improved three‑fold, consistently staying below 10 seconds. Storage efficiency doubled thanks to Parquet‑based Iceberg tables. The platform now offers greater self‑service flexibility, faster joins, and robust scaling.
Future Roadmap
The team plans to explore a near‑real‑time lakehouse by combining StarRocks with Paimon and to design optimisations for primary‑key‑driven lake analytics scenarios.
StarRocks
StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
