Cutting Costs 40% and Halving Query Latency: Our ClickHouse‑to‑StarRocks Migration
Facing high costs and scaling limits with ClickHouse, we migrated a 4000‑core, 500TB OLAP workload to StarRocks, achieving 40% cost reduction, 50% storage savings, and up to 30× query speedups through storage‑compute separation, materialized‑view rewrites, and extensive performance tuning.
Background
DeWu’s smart‑operation platform relies on OLAP engines for customer service, risk control, supply chain, advertising, and A/B testing. Historically it operated five different engines—Alibaba Cloud Hologres, ADB, ClickHouse, and two self‑hosted ClickHouse/StarRocks clusters—leading to high operational overhead.
ClickHouse Challenges at Scale
Even though ClickHouse offers excellent single‑node performance, its distributed architecture struggled with horizontal scalability, metadata management, data consistency, and join performance. The migrated ClickHouse cluster served over 4000 CPU cores, stored more than 500 TB, and suffered from:
Opaque materialized‑view rewriting (no automatic view substitution).
Lack of bulk‑load capability, causing inefficient offline imports.
Vertical scaling ceiling and week‑long downtime for horizontal resharding.
To mitigate these issues, DeWu built a smaller standby cluster for hot data, but this added >50% extra cost.
Migration to StarRocks
In 2024, DeWu initiated a proof‑of‑concept to replace ClickHouse with a self‑hosted StarRocks cluster that supports storage‑compute separation. The migration involved:
Deploying a 4000‑core StarRocks cluster with compute nodes (CN) and remote object storage (OSS) for data.
Gradually routing DSL‑translated queries to StarRocks via the ONE DSL platform, enabling gray‑scale rollout and instant rollback.
Developing a ClickHouse‑to‑StarRocks DDL conversion tool to automate table and materialized‑view definitions.
Key architectural change: data resides in OSS (Distributed Storage) while compute nodes cache only hot data, allowing a single‑replica mode that reduces storage cost by two‑thirds.
Cost and Performance Benefits
After migration:
Overall infrastructure cost dropped 40% (AP engine) and storage cost fell to one‑sixth of the original.
Query latency halved: P95 of the main dashboard fell from 8.5 s to 4.3 s; high‑load queries dropped from 24.38 s to 11.94 s.
Cluster‑wide outage time shrank from 15 minutes to 30 seconds.
Storage‑compute separation also yielded a 2/3 reduction in offline import time and a 40%+ overall cost decrease.
Materialized‑View Enhancements
StarRocks introduced transparent materialized‑view rewriting, eliminating the need for external code to manage view creation and query rewriting. Additional optimizations include:
Limiting queries that span more than eight days on large tables.
Forcing rewrite mode (materialized_view_rewrite_mode='force') to hit views for multiple count‑distinct aggregations.
Automatic recommendation of materialized views based on FE‑captured SQL patterns, with statistics on field usage to suggest optimal sort keys.
Performance gains from view‑related fixes:
Improved partition‑field pruning (PR #46786).
Optimized implicit type conversions, delivering 20× speedup (PR #50168).
Enhanced date/datetime handling, achieving 30× speedup (PR #50643).
Reduced RPC calls to StarOS, cutting query time by >3 seconds (PR #46913).
Offline Import Improvements
Instead of ClickHouse’s single‑node insert, StarRocks loads data via OSS broker load, allowing controlled concurrency and reducing cluster resource consumption. Benchmarks show a two‑thirds reduction in import latency.
brand_id IN (SELECT brand_id FROM olap_znyy.st_itg_spu_info_all WHERE brand_level_name IN ('A级品牌','B级品牌','C级品牌'))
-- rewritten to
brand_id IN (SELECT DISTINCT brand_id FROM olap_znyy.st_itg_spu_info_all WHERE brand_level_name IN ('A级品牌','B级品牌','C级品牌'))Compatibility Layer
StarRocks added a ClickHouse dialect (sql_dialect=clickhouse) that maps 72 common ClickHouse functions to StarRocks equivalents, preserving query semantics without user changes. It also implements ClickHouse‑style ArrayJoin via unnest and supports alias reuse in SELECT clauses.
Operational and Observability Enhancements
New APIs were introduced to expose resource watermarks, SQL complexity factors, materialized‑view metadata, and asynchronous query cancellation (custom_query_id). These aid monitoring and troubleshooting at scale.
Results and Conclusion
The migration met all cost and performance targets, providing a scalable foundation for future growth. DeWu contributed 28 PRs (24 merged) to the StarRocks open‑source project, fixing over 40 issues and adding 10+ features, reinforcing the community partnership.
Overall, the case demonstrates that storage‑compute separation, transparent materialized‑view rewriting, and careful query tuning can dramatically reduce both operational expense and latency for large‑scale OLAP workloads.
StarRocks
StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
