Big Data 14 min read

How StarRocks Boosted Mango TV’s Data Platform Performance by Over 10×

Mango TV replaced its fragmented EMR‑Hive‑Kudu‑Presto stack with a unified StarRocks lakehouse, simplifying architecture, cutting operational costs, and achieving more than a ten‑fold increase in query speed while supporting real‑time analytics, materialized views, bitmap indexing, and store‑compute separation.

StarRocks

Jun 29, 2023

Original Architecture and Pain Points

The legacy smart‑operation platform used an EMR cloud cluster: Hive stored historical data, Kudu stored real‑time data, and Presto served as a federated query engine. As business complexity grew, the architecture became fragmented, maintenance costs rose, and query performance degraded, especially under high concurrency.

Selection Criteria for a New Engine

High stability : Must guarantee high availability before optimizing performance.

Simple architecture, low maintenance : Avoid heavy components like Iceberg or Hudi.

Efficient multi‑table joins : Join performance should surpass ClickHouse for complex queries.

Federated query support : Ability to query across many Hive tables without massive migration effort.

Overall query efficiency : Reduce long query latency reported by product and operations teams.

Self‑contained ecosystem : Minimal reliance on external services such as Hadoop or ZooKeeper.

Storage‑compute separation : Leverage object storage to lower storage costs and enable elastic scaling.

Adopting StarRocks

After a Q1 2023 evaluation of multiple engines, StarRocks was chosen for its stability, high query speed, and native support for storage‑compute separation, making it the core engine of the new lakehouse architecture.

Implementation Details

Data flow:

Behavioral logs are collected via Flume, sent to OSS and Kafka. After cleaning, OSS data is loaded into Hive for offline analysis.

Kafka streams are processed by Flink, then written to Kudu for real‑time access.

Business MySQL data is synchronized to Kudu through an in‑house data sync platform.

Presto previously federated queries across Hive, Kudu, and MySQL.

With StarRocks, the pipeline changed to:

Flink cleans behavior data and writes to a Kafka topic, which is then ingested into StarRocks via Routine Load.

Business data is loaded into StarRocks using Stream Load for full loads and Flink‑Canal for incremental binlog updates.

Key technical configurations:

#1 Import data into auto‑increment ID table
USE routine load, set "partial_update"="true"

#2 Table definition for auto_user
CREATE TABLE `auto_user` (
  `did` varchar(64) NOT NULL COMMENT "",
  `id` bigint(20) NOT NULL AUTO_INCREMENT COMMENT "自增id"
) ENGINE=OLAP
PRIMARY KEY(`did`)
COMMENT "用户自增ID表"
DISTRIBUTED BY HASH(`did`) BUCKETS 6;

#3 Bitmap table for daily active users
CREATE TABLE `active_user` (
  `date` int(11) NOT NULL COMMENT "",
  `bm_ids` bitmap BITMAP_UNION NOT NULL COMMENT ""
) ENGINE=OLAP
AGGREGATE KEY(`date`)
COMMENT "日活跃用户bitmap表"
DISTRIBUTED BY HASH(`date`) BUCKETS 3;

#4 Insert data into bitmap table
INSERT INTO active_user SELECT `date`, to_bitmap(b.id) FROM event a LEFT JOIN auto_user b ON a.did=b.did WHERE `date`=20230501;

#5 Query intersected bitmap for a week
SELECT bitmap_count(bitmap_intersect(bm_ids)) FROM active_user WHERE `date`>=20230501 AND `date`<=20230507;

Practical Experience

Real‑time sync of MySQL data : Used primary‑key model with index persistence to avoid memory overflow on BE nodes.

Materialized views : Accelerated long‑running queries (tens of seconds to minutes) down to sub‑second latency without query changes.

Aggregation models : Ingested raw data into aggregate tables for millisecond‑level real‑time statistics.

Bitmap indexing : Enabled second‑level massive user tag selection; converted string UUIDs to numeric IDs for bitmap functions.

Query cache : StarRocks per‑tablet cache reduced query time by over 50% compared with the previous custom cache layer.

Compression format : Adopted Zstd after benchmarking; offered ~20% better compression ratio with only ~5% query/write overhead.

Compaction tuning : Adjusted LSM‑Tree compaction parameters to balance write throughput and query latency.

Performance Testing

Benchmarks using the Star Schema Benchmark (SSB) and real‑world multi‑table join queries showed consistent performance gains over the previous Presto‑based solution, with query times reduced by an order of magnitude.

Future Plans

StarRocks 3.0 introduced native storage‑compute separation, positioning it as a cloud‑native lakehouse solution. The team plans to build a cloud‑native data lake‑warehouse on top of StarRocks, leveraging its elastic scaling, cost‑effective object storage, and high availability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data StarRocks Bitmap Index Materialized Views Store-Compute Separation

Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.