Unifying OLAP with StarRocks on Alibaba Cloud EMR: Waterdrop’s Real‑World Journey
This article details Waterdrop's transition from ClickHouse and TiDB to a unified StarRocks‑based OLAP platform on Alibaba Cloud EMR, covering architecture evolution, performance benchmarks, best‑practice recommendations, and future plans for large‑scale real‑time analytics.
Abstract
Waterdrop’s big‑data department data‑development engineer Han Yuanyuan shares practical experience of using StarRocks on Alibaba Cloud EMR.
Company Overview
Waterdrop was founded in 2016, offering services such as Waterdrop Crowdfunding and an insurance marketplace. It went public on May 7, 2021, and aims to provide affordable health‑care solutions for billions of families.
From July 2016 to the end of 2022, the platform attracted 4.3 billion donors, helped over 2.77 million patients, raised 56.9 billion CNY in medical funds, and offered 755 insurance products.
StarRocks Overview
Development timeline:
2018 – Introduced ClickHouse for monitoring and user‑behavior analysis.
2020 – Adopted TiDB for OLAP analysis and reporting.
2021 – Deployed a self‑built StarRocks v1.17.8 cluster for OLAP.
Feb 2022 – Upgraded to StarRocks v1.19.5 for reporting.
Oct 2022 – Migrated the self‑built cluster to Alibaba Cloud EMR StarRocks (v2.3.2) and moved all TiDB services to StarRocks.
Mar 2023 – Joined the Alibaba Cloud EMR Serverless StarRocks public test and applied new features to business scenarios.
Current Situation & Technology Selection
Waterdrop evaluated ClickHouse, TiDB, and StarRocks on four criteria: concurrency, materialized‑view support, join capability, and real‑time write performance. StarRocks excelled in all four, while ClickHouse lagged in concurrency and joins, and TiDB lacked materialized‑view support.
Consequently, StarRocks was chosen as the unified OLAP engine, and TiDB services were migrated.
Scenario 1: Report Platform OLAP Engine Unification
The reporting platform originally used TiDB and later added StarRocks, leading to component proliferation, high cost, and TiDB’s concurrency limits. After migrating all data to StarRocks, performance tests showed:
SQLs that finished within 400 ms on TiDB completed within 200 ms on StarRocks.
SQLs taking 400 ms–1.5 s on TiDB finished in 184 ms–300 ms on StarRocks.
SQLs taking 1.5 s–4 s on TiDB completed in 198 ms–500 ms on StarRocks.
The unified architecture reduced operational cost by 58 % and improved overall performance by 40 %.
Scenario 2: Financial Reconciliation System
The finance push‑account system, originally powered by TiDB, required high real‑time, high‑consistency, and complex multi‑table joins on billions‑row tables. After routing writes to TiDB and then synchronizing to StarRocks, per‑scenario processing time dropped from ~30 minutes to ~30 seconds, a 60‑fold speedup, while TiDB’s compute load decreased by 70 %.
Best Practices
Table Design
Partition tables by time fields and bucket by frequently queried columns.
Use integer columns as sort keys for GROUP BY and filtering.
Apply dynamic partitions for large detail tables to manage data expiration.
Choose precise data types (e.g., INT instead of BIGINT, appropriate string lengths).
Place numeric columns before string columns.
Data Synchronization
Offline loads use BrokerLoad and SparkLoad.
Real‑time ingestion employs Flink‑CDC and a custom Galaxy platform.
Control write frequency to reduce merge overhead and ensure stability.
Use replace_if_not_null for UniqueKey updates and PrimaryKey partial updates.
Operations & Monitoring
Four‑layer load balancing for Frontend (FE) nodes to ensure high availability.
Tune cluster parameters such as parallel_fragment_exec_instance_num and exec_mem_limit for better query concurrency and memory usage.
Monitor with Prometheus + Grafana; track slow and large SQLs for timely alerts.
Permissions & Resources
Separate accounts to avoid resource contention and simplify monitoring.
Define resource groups per business scenario for query isolation.
Centralize DDL permissions to enhance security.
Data Management & Quality
Analyze query logs regularly for table lifecycle management.
Perform T+1 offline data quality checks and hourly/daily real‑time validation.
Current Issues
Missing AUTO_INCREMENT and CURRENT_TIMESTAMP support.
String length limits for certain fields.
Log format not optimal for error analysis.
Write‑frequency control for real‑time data.
Time fields lack millisecond precision.
CPU isolation is incomplete; row‑level permission control is unavailable.
Future Planning
Three focus areas for 2023‑2024:
User Portrait : Transition from HBase + ES to StarRocks for massive wide‑tables (1000 billion+ rows) with frequent column updates.
Monitoring & Alerts : Achieve minute‑level or sub‑minute real‑time monitoring using StarRocks.
User Behavior Analysis : Replace ClickHouse for funnel, retention, and path analysis, handling multi‑table joins on tables exceeding 1000 billion rows.
Key milestones include expanding StarRocks usage in H1 2023, upgrading to version 2.5+ in July 2023, real‑time ingestion of click‑stream and binlog data by October 2023, and completing OLAP engine unification by the end of 2023.
Acknowledgements
Thanks to the Alibaba Cloud StarRocks team for technical support, and to the community for continuous contributions.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
