Big Data 17 min read

How StarRocks’ Compute‑Storage Separation Cut Costs 46% and Boosted Performance

This article details a Chinese tech company's migration of its internal big‑data analytics platform to StarRocks’ compute‑storage separation architecture, describing the original multi‑component setup, the pain points encountered, the evaluation methodology, performance and cost benchmarks, operational optimizations, migration steps, and future roadmap.

StarRocks

Nov 22, 2023

Background

The internal big‑data platform supports four business scenarios: user profiling, reporting, experimentation, and service monitoring. Data flow: Kafka → Flink → OSS (object storage) → Hive/Spark for further processing → multiple OLAP engines (Trino, ClickHouse, StarRocks). Over time the component count grew, leading to high maintenance overhead and operational complexity.

Pain Points

High maintenance cost : Three OLAP systems require different query languages, monitoring tools, and skill sets.

Long latency : End‑to‑end ODS processing takes ~3 hours; occasional re‑runs increase latency.

Stability issues : Trino experiences memory‑overflow failures (~10% query failure rate) under heavy concurrency; ClickHouse suffers CPU saturation.

StarRocks Compute‑Storage Separation Evaluation

StarRocks 3.0 introduced a compute‑storage separation mode. The evaluation focused on five dimensions:

Query efficiency – no noticeable slowdown compared with the integrated mode.

Cost reduction – moving data to object storage dramatically lowers storage expense.

Seamless replacement – ability to replace Trino and ClickHouse without major refactoring.

Operational simplicity – reduced DevOps workload via Kubernetes Operator and built‑in monitoring.

Community activity – active open‑source community for timely issue resolution.

Performance Comparison

Two clusters of comparable raw resources were benchmarked:

ClickHouse: 1 node × 96 CPU × 384 GB RAM

StarRocks: 6 nodes × 16 CPU × 64 GB RAM per node

Workloads:

Single‑table query (200 GB, 130 M rows) – count aggregation.

Multi‑table join query – self‑join followed by count.

Results: StarRocks matched ClickHouse on the single‑table benchmark and was approximately three times faster on the join benchmark.

Cost Reduction

Storage cost comparison (per TB per day): standard cloud disk ≈ $7, object storage ≈ $0.48 → ~1/15 of the original expense. By replacing ClickHouse and Trino with StarRocks and moving data to OSS, overall platform cost dropped by 46%.

Usability and Operational Enhancements

StarRocks provides a Kubernetes Operator for one‑click cluster deployment and automatic failover. Built‑in monitoring integrates with Prometheus and Grafana, exposing FE/BE metrics and I/O statistics.

Local disk cache: each compute node is equipped with two 200 GB SSDs. Hot data is cached locally; LRU eviction prevents out‑of‑space errors.

Data bucketing was tuned to 1‑3 GB per bucket to limit the number of tablets and avoid FE memory pressure.

Query Optimization

Materialized views : In a real‑time analysis scenario, raw detail queries took ~30 s. After creating an asynchronous materialized view for pre‑aggregation, query latency dropped to ~3 s (≈10× speed‑up).

Aggregation model : Previously, Spark pre‑aggregated data into result tables before loading into StarRocks, causing data duplication. Switching to StarRocks’ native aggregation model eliminated the extra storage and preserved query performance.

Cache configuration : All tables were created with ENABLE_LOCAL_DISK_CACHE = true. With six BE nodes each holding two 200 GB SSDs, the majority of hot data resides on local disks, and automatic LRU eviction manages space contention.

Monitoring and Stability

Prometheus‑Grafana dashboards cover FE/BE CPU, memory, I/O, and compaction metrics. Example screenshots:

Compaction status can be queried via SQL, enabling early detection of slow compaction or resource bottlenecks.

Data Migration Strategy

Because a one‑click migration tool is not yet available, data were exported from the existing clusters to OSS and then ingested into the new StarRocks cluster using Broker Load. To reduce I/O contention between load jobs and online queries, concurrency was limited and imports were batched. Over 80% of production data have been migrated successfully.

Future Plans

Short‑term

Move ODS processing into Flink and write directly to a lake (e.g., Apache Hudi or Iceberg), eliminating the 2‑3 hour ODS latency.

Long‑term

Adopt a lake‑house architecture where StarRocks serves as the sole compute layer, achieving:

Resource isolation per business line.

Elimination of duplicate storage (single OSS layer).

Elastic scaling of compute nodes via Kubernetes.

Overall, the migration to StarRocks compute‑storage separation delivered comparable query performance, up to three‑fold speed‑up on join workloads, 46% cost reduction, and a simpler operational model, positioning the platform for scalable, cost‑effective analytics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Kubernetes StarRocks Performance Benchmark data lake cost reduction Compute-Storage Separation

Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.