How We Built a Fast, Flexible OLAP Platform with StarRocks at YY
This article details how YY's data team replaced ClickHouse with StarRocks to create a unified, high‑performance OLAP platform, covering architecture design, engine selection, performance benchmarks, resource isolation, monitoring, user‑friendly tooling, and future roadmap for large‑scale analytics.
Background
Global expansion of a social‑media platform introduced heterogeneous user behavior, payment channels, logistics, and compliance requirements. Existing OLAP infrastructure could not satisfy the need for flexible, low‑cost analytics across multiple regions, prompting a redesign of the analytics stack.
Data Platform Architecture
The platform covers the full data lifecycle—from event collection to downstream analytics—and is organized into five logical layers:
Data integration (ingestion, routing)
Storage (HDFS, object storage)
Computation (batch and streaming)
Analysis (OLAP engine)
Application (BI, self‑service analytics)
The OLAP system sits in the analysis layer and serves ad‑hoc queries, multi‑dimensional analysis, BI reports, and user‑profile generation.
OLAP Engine Requirements
Prioritize flexibility over raw performance.
Single‑engine solution that can handle batch and streaming ingestion.
Low operational cost with easy horizontal scaling.
Compatibility with the broader big‑data ecosystem (Hadoop, Kafka, Flink, DataX, etc.).
Support ROLAP/MOLAP, wide‑table/star‑schema models, PB‑scale data, sub‑second latency.
Write modes: Append, Overwrite, Upsert, Delete.
High QPS and ability to query Hadoop‑resident data.
Evaluation of Candidates
Three typical OLAP architectures were examined:
Pre‑compute (Apache Kylin, Apache Druid) – excellent query speed but limited flexibility.
MPP (Presto, Apache Impala, SparkSQL) – flexible but generally minute‑level latency.
Index‑based (Elasticsearch, ClickHouse) – strong single‑table performance, poor multi‑table joins.
Because a single architecture could not meet all requirements, a hybrid engine combining pre‑compute, MPP, and indexing capabilities was sought. Among mature options, Apache Doris and StarRocks were considered; StarRocks was chosen for its active community and rapid feature delivery.
Benchmark Results
Test environment: 12 billion rows, three servers each with 32 CPU cores, 128 GB RAM, SSD storage, running TPCH queries.
StarRocks achieved 4‑10× higher throughput than ClickHouse on typical workloads. Key differences:
Disaster recovery: StarRocks provides automatic data balancing, online scaling, and built‑in backup/restore; ClickHouse requires manual procedures.
Flexibility: StarRocks handles both single‑table and multi‑table queries efficiently; ClickHouse struggles with joins.
Operations: StarRocks has a simpler architecture without a Zookeeper dependency, reducing maintenance complexity.
Post‑Migration OLAP State
After migration the OLAP layer became lightweight and tightly integrated with upstream and downstream services. StarRocks supports multiple ingestion paths:
HTTP stream load
Broker load from HDFS
Routing load from messaging systems (e.g., Kafka)
Flink connector
DataX
External tables (Hive)
All queries use the MySQL JDBC protocol, enabling seamless BI integration. The platform now serves hundreds of terabytes, dozens of business units, and millions of daily queries with a 99th‑percentile latency of ~200 ms.
Resource Isolation
High‑cost BI dashboards caused resource contention, degrading latency‑sensitive queries. Starting with StarRocks 2.2, resource groups with configurable CPU/memory quotas were introduced. A controlled experiment compared concurrent users with and without resource groups, demonstrating that large queries can be isolated from small, latency‑critical queries. Note that CPU quotas are soft limits; performance still degrades when the cluster is fully saturated.
Monitoring and Alerting
Metrics are collected via Prometheus + Grafana from Front‑End (FE) and Back‑End (BE) nodes. An audit‑log pipeline parses audit.log, extracts the execution plan, and aggregates CPU, memory, row count, and scanned data size to proactively detect slow queries.
Operational Practices
The team follows a “no internal fork” policy, deploying community releases and cherry‑picking critical fixes. Contributions include issue reporting, pull‑requests, and co‑development of resource‑isolation features with the StarRocks community.
Future Plans
Implement table‑creation audit to enforce proper partitioning and bucketing.
Analyze query patterns and create materialized views for hot reports.
Optimize multi‑table joins using colocation joins to reduce network latency.
Collaborate with the StarRocks community on further resource‑isolation enhancements.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
StarRocks
StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
