How Xiaohongshu Scaled Real‑Time Analytics with StarRocks: 6‑7× Faster Queries and 35% Cost Savings
Xiaohongshu’s OLAP team migrated from Presto to StarRocks, doubling cluster count to 30, boosting query speed by 6‑7 times, cutting latency to 200 ms, and achieving up to 35% cost reduction through gray‑scale migration and AWS Spot‑based elastic scaling.
Background and Scale
Over the past two years Xiaohongshu’s OLAP platform has adopted StarRocks as a core engine. The deployment grew from a few clusters to 30 clusters with a total of 30,000 CPU cores . Daily ingestion reaches the hundred‑billion‑row level and the system processes over 100 million queries per day . Individual clusters sustain peak QPS of 2‑3 k with an average latency of ~200 ms.
Data Platform Architecture
Storage layer: cloud object storage (e.g., S3, OSS).
Table‑format layer: Hive and Iceberg metadata.
Data processing layer: offline pipelines built on Spark; real‑time pipelines built on Flink.
Query layer: multiple engines (Presto, StarRocks, ClickHouse, etc.) accessed via a gateway service (Kyuubi).
Application layer: real‑time dashboards, self‑service analytics, ad‑hoc analysis.
Real‑Time Analytics with StarRocks
StarRocks’ compute‑storage integrated model now powers reporting for ads, community, and live‑commerce. All major business domains are served directly from StarRocks without an intermediate data sync step.
Problems with the Presto‑Centric Stack
Architectural complexity: maintaining three separate query engines (Presto, ClickHouse, StarRocks) increased operational overhead.
Performance tuning on Presto was difficult and often required manual configuration.
Presto’s master‑slave architecture introduced a single‑point‑of‑failure risk.
Ad‑hoc workloads required full data sync to StarRocks, which was impractical for the petabyte‑scale lake.
Migration Rationale
StarRocks delivers a higher performance‑to‑cost ratio, especially for low‑latency, high‑concurrency queries.
Replacing Presto simplifies the tech stack, reducing O&M effort.
Migration steps are straightforward:
Define an external catalog in StarRocks that points to Hive/Iceberg tables.
Port existing Presto Java UDFs to StarRocks (StarRocks supports Java UDFs out of the box).
Enable Trino dialect compatibility in StarRocks 3.0 to minimize user‑query changes.
Validation Dimensions
Correctness : Tested from StarRocks 3.0 to 3.1; community quickly resolved early bugs.
Stability : Version 3.1 ran continuously for >1 week under 10‑30 concurrent queries, outperforming Presto.
Performance : Executed 3,000 production queries at 10, 20, and 30 concurrency levels. 96 % of queries were faster , with an average speedup >4× across all concurrency levels.
Compatibility : After syntax enhancements, Presto‑compatible syntax coverage reached 90 % (CTAS added in 3.2), overall query coverage ~85 %.
Gray‑Scale Migration Mechanism
The original Presto cluster was split and a new StarRocks cluster was added. Query routing is performed by Kyuubi:
When a query arrives, Kyuubi parses the SQL and checks whether the syntax is supported by StarRocks.
If unsupported, the query is sent to Presto.
If supported, the query is dispatched to StarRocks or Presto based on a configurable gray‑scale ratio (e.g., 0 % → 100 % over a month). The ratio can be adjusted in real time without service interruption.
Elastic Scaling with AWS Spot Instances
StarRocks compute nodes (CN) are containerized. An automation framework performs the following steps:
Detect scaling need (CPU or query‑concurrency threshold).
Request Spot instances from AWS via the EC2 API.
Deploy the CN container on the acquired instance using a predefined script.
Register the new CN with the Front‑End (FE) service via the StarRocks REST registration API.
On scale‑down, deregister the CN from FE and release the Spot instance back to AWS.
The entire scale‑in/out cycle completes within 2 minutes . During low‑traffic windows (00:00‑08:00) up to 90 % of CN nodes are returned to Spot, achieving a ~35 % cost reduction while keeping query latency stable.
Future Plans
Short‑term : Expand lake‑warehouse capabilities (e.g., broader Iceberg support), improve Java UDF ergonomics, and add more Trino‑compatible syntax.
Long‑term : Pursue full compute‑storage separation and a unified lake‑warehouse architecture. This includes:
Materialized view generation from Spark to produce StarRocks‑native formats for accelerated queries.
Dynamic, query‑driven creation of materialized views to automatically cover hot data.
Integration of Spark‑generated StarRocks files into the lake, enabling true read‑write separation.
Key Diagrams
StarRocks
StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
