StarRocks at Shopee: Practical Use Cases and Performance Analysis
Shopee’s deployment of StarRocks across DataService, DataGo, and DataStudio demonstrates that its vectorized engine, cost‑based optimizer, and materialized‑view caching can query Hive, Iceberg, Delta Lake and Hudi up to 20,000× faster than Presto, cutting CPU usage and delivering consistently lower latency for complex analytics.
This article introduces StarRocks (SR), a SQL query engine that provides data warehouse-level performance on data lakehouses. StarRocks offers powerful features including a vectorized execution engine, cost-based optimizer, data caching, and materialized views with transparent query rewriting capabilities. It supports direct querying of popular data lake table formats such as Hive, Iceberg, Delta Lake, and Hudi through its built-in catalog functionality.
At Shopee, StarRocks has been deployed across three major scenarios:
1. DataService - Building Low-Cost, High-Speed External Data Lake: Using StarRocks MV on External Catalog (Hive) to improve query performance. The solution extracts the CTE portion from user SQL and transforms it into StarRocks asynchronous materialized view statements. This approach achieves 10 to 20,000 times query speedup without requiring users to maintain real-time or near-real-time write pipelines from Hive to OLAP engines.
2. DataGo - Accelerating Table Joins and Insight Extraction: Leveraging StarRocks' multi-table JOIN capabilities with its cost-based optimizer specifically designed for vectorized execution. Performance testing shows StarRocks on Hive outperforms Presto on Hive by 3 to 10 times, while saving 60% CPU resources. The StarRocks cluster uses 10GB memory cache and 400GB disk cache, achieving high cache hit rates for business queries with relatively fixed patterns.
3. DataStudio - Replacing Presto for Faster Query Speed and Resource Savings: Using StarRocks on Hive as the query execution engine. With identical compute resources (400 cores + 27,000GB memory), StarRocks delivers 2 to 3 times better performance than Presto. The optimization is particularly effective for complex queries with nested subqueries and joins, which are typical characteristics of ad-hoc queries.
The article provides detailed benchmark results comparing StarRocks and Presto, including p90 and p99 latency measurements under various concurrency levels (1, 5, 20, 40, and 50 threads). The conclusion emphasizes that StarRocks is ideal for data analytics requirements, offering high-speed external data lake queries, optimized multi-table JOIN performance, and superior execution speed compared to Presto.
Shopee Tech Team
How to innovate and solve technical challenges in diverse, complex overseas scenarios? The Shopee Tech Team will explore cutting‑edge technology concepts and applications with you.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.