How StarRocks’ Compute‑Storage Separation Cut Costs 46% and Boosted Performance
This article details a Chinese tech company's migration of its internal big‑data analytics platform to StarRocks’ compute‑storage separation architecture, describing the original multi‑component setup, the pain points encountered, the evaluation methodology, performance and cost benchmarks, operational optimizations, migration steps, and future roadmap.
Background
The internal big‑data platform supports four business scenarios: user profiling, reporting, experimentation, and service monitoring. Data flow: Kafka → Flink → OSS (object storage) → Hive/Spark for further processing → multiple OLAP engines (Trino, ClickHouse, StarRocks). Over time the component count grew, leading to high maintenance overhead and operational complexity.
Pain Points
High maintenance cost : Three OLAP systems require different query languages, monitoring tools, and skill sets.
Long latency : End‑to‑end ODS processing takes ~3 hours; occasional re‑runs increase latency.
Stability issues : Trino experiences memory‑overflow failures (~10% query failure rate) under heavy concurrency; ClickHouse suffers CPU saturation.
StarRocks Compute‑Storage Separation Evaluation
StarRocks 3.0 introduced a compute‑storage separation mode. The evaluation focused on five dimensions:
Query efficiency – no noticeable slowdown compared with the integrated mode.
Cost reduction – moving data to object storage dramatically lowers storage expense.
Seamless replacement – ability to replace Trino and ClickHouse without major refactoring.
Operational simplicity – reduced DevOps workload via Kubernetes Operator and built‑in monitoring.
Community activity – active open‑source community for timely issue resolution.
Performance Comparison
Two clusters of comparable raw resources were benchmarked:
ClickHouse: 1 node × 96 CPU × 384 GB RAM
StarRocks: 6 nodes × 16 CPU × 64 GB RAM per node
Workloads:
Single‑table query (200 GB, 130 M rows) – count aggregation.
Multi‑table join query – self‑join followed by count.
Results: StarRocks matched ClickHouse on the single‑table benchmark and was approximately three times faster on the join benchmark.
Cost Reduction
Storage cost comparison (per TB per day): standard cloud disk ≈ $7, object storage ≈ $0.48 → ~1/15 of the original expense. By replacing ClickHouse and Trino with StarRocks and moving data to OSS, overall platform cost dropped by 46%.
Usability and Operational Enhancements
StarRocks provides a Kubernetes Operator for one‑click cluster deployment and automatic failover. Built‑in monitoring integrates with Prometheus and Grafana, exposing FE/BE metrics and I/O statistics.
Local disk cache: each compute node is equipped with two 200 GB SSDs. Hot data is cached locally; LRU eviction prevents out‑of‑space errors.
Data bucketing was tuned to 1‑3 GB per bucket to limit the number of tablets and avoid FE memory pressure.
Query Optimization
Materialized views : In a real‑time analysis scenario, raw detail queries took ~30 s. After creating an asynchronous materialized view for pre‑aggregation, query latency dropped to ~3 s (≈10× speed‑up).
Aggregation model : Previously, Spark pre‑aggregated data into result tables before loading into StarRocks, causing data duplication. Switching to StarRocks’ native aggregation model eliminated the extra storage and preserved query performance.
Cache configuration : All tables were created with ENABLE_LOCAL_DISK_CACHE = true. With six BE nodes each holding two 200 GB SSDs, the majority of hot data resides on local disks, and automatic LRU eviction manages space contention.
Monitoring and Stability
Prometheus‑Grafana dashboards cover FE/BE CPU, memory, I/O, and compaction metrics. Example screenshots:
Compaction status can be queried via SQL, enabling early detection of slow compaction or resource bottlenecks.
Data Migration Strategy
Because a one‑click migration tool is not yet available, data were exported from the existing clusters to OSS and then ingested into the new StarRocks cluster using Broker Load. To reduce I/O contention between load jobs and online queries, concurrency was limited and imports were batched. Over 80% of production data have been migrated successfully.
Future Plans
Short‑term
Move ODS processing into Flink and write directly to a lake (e.g., Apache Hudi or Iceberg), eliminating the 2‑3 hour ODS latency.
Long‑term
Adopt a lake‑house architecture where StarRocks serves as the sole compute layer, achieving:
Resource isolation per business line.
Elimination of duplicate storage (single OSS layer).
Elastic scaling of compute nodes via Kubernetes.
Overall, the migration to StarRocks compute‑storage separation delivered comparable query performance, up to three‑fold speed‑up on join workloads, 46% cost reduction, and a simpler operational model, positioning the platform for scalable, cost‑effective analytics.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
StarRocks
StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
