How StarRocks Boosted Query Performance 2‑3× for a 1TB‑Daily Data Platform
The Qunhe Technology data team replaced their legacy Hadoop and Presto clusters with a StarRocks MPP database, achieving up to three times faster queries, supporting billion‑row tables and sub‑second latency for both real‑time and analytical workloads on a daily 1TB data influx.
Background
The data team required a high‑performance big‑data platform to support daily BI, commercial data products and real‑time analytics for a 3D design SaaS platform that generates roughly 400,000 design schemes per day. Incremental data volume approaches 1 TB per day.
Challenges
Rapid data growth leading to massive incremental tables.
Numerous offline and real‑time ETL jobs performing simple aggregations, causing an explosion of aggregated tables and high operational cost.
Medium‑scale queries needed sub‑200 ms response times.
Real‑time use cases (user profiling, DMP, monitoring) required low‑latency inserts and updates.
Engine Evaluation and Selection
Benchmarks of open‑source OLAP engines (Impala, Druid, ClickHouse, StarRocks) showed that StarRocks best satisfied the requirements thanks to its MPP architecture, native storage, primary‑key updates, materialized‑view support, and ability to handle both high‑concurrency ad‑hoc queries and high‑throughput workloads.
Architecture & Implementation
StarRocks was deployed on a 10‑node physical‑machine cluster, replacing an 8‑node cloud‑DB cluster and an 8‑node Presto cluster. Data ingestion is handled as follows:
Offline data resides in ODPS and is batch‑synced to StarRocks via DataX using the StarRocksWriter plugin, which leverages Stream Load.
Real‑time data streams from Apache Kafka and is written to StarRocks through Routine Load or Flink CDC using the flink-connector-starrocks plugin, also based on Stream Load.
StarRocks provides a MySQL‑compatible SQL interface for rapid development of data‑driven features.
Performance Results
Online query P95 latency dropped to the millisecond level; analytical query P95 latency remained within seconds.
Billion‑row detail tables with aggregation and deduplication complete in approximately 500 ms.
Colocate Join optimization enables multi‑table joins on tens of millions of rows to finish within seconds.
Real‑time Capabilities
Stream Load provides second‑to‑minute micro‑batch ingestion.
Routine Load achieves sub‑second latency for Kafka‑to‑StarRocks pipelines.
Dynamic partitioning supports efficient data lifecycle management.
SQL online serving with MySQL‑compatible syntax simplifies data‑driven feature development.
Future Plans
Migrate duplicate data models to aggregation models and materialized views to further reduce warehouse maintenance.
Expand StarRocks usage to user‑profile updates, behavior‑path analysis, and other real‑time data‑application scenarios.
Adopt a multi‑cloud architecture for flexible data‑warehouse migration, cost control and higher availability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
StarRocks
StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
