Databases 17 min read

Practical Experience of StarRocks Materialized Views at Didi

This article presents Didi's practical experience with StarRocks materialized views, covering the evolution of its OLAP architecture, the challenges of previous engines, the adoption of StarRocks, the design of materialized view acceleration for real‑time dashboards, and future optimization directions.

DataFunTalk
DataFunTalk
DataFunTalk
Practical Experience of StarRocks Materialized Views at Didi

Didi's OLAP system originally relied on Druid for real‑time monitoring and Kylin for offline query acceleration. By 2020, the growing complexity and scale of business workloads made these engines insufficient in performance, stability, usability, and maintenance cost.

To address these issues, Didi introduced ClickHouse, an open‑source columnar OLAP engine, and later adopted StarRocks—a next‑generation MPP database with vectorized execution, pipeline engine, CBO, and intelligent materialized views. StarRocks supports real‑time updates, strong primary‑key storage, and high‑concurrency analytics, quickly gaining traction in the open‑source community.

StarRocks offers several advantages: a simple distributed architecture without external dependencies, superior query performance through columnar storage and vectorization, better QPS handling, flexible data models (detail, aggregate, primary‑key tables), and native lake‑warehouse capabilities that allow seamless federation with Hive, Iceberg, and Hudi.

As of May 2023, Didi operates over 30 StarRocks clusters storing more than 300 TB of data, serving over 4 million daily queries across more than 1,500 tables for virtually all business lines.

For platform construction, Didi integrates StarRocks via SparkLoad for batch ingestion and the Flink‑StarRocks connector for real‑time streaming. An internal cloud‑native operation platform enables one‑click cluster provisioning within an hour.

Engine-wise, Didi employs containerization, resource isolation, and dual‑path mechanisms to meet diverse stability requirements, offering dedicated physical, container, and shared clusters. Monitoring, alerting, and query analysis tools are exposed to users for performance tuning.

The second part details how materialized views accelerate real‑time dashboards. Didi's core dashboard monitors over 20 metrics (calls, GMV, etc.) with high‑precision distinct counting, flexible dimension filtering, and massive concurrent queries during peak events.

StarRocks materialized views are built in three layers: an ODS layer storing raw detail data, a DWD layer with synchronous materialized views for incremental aggregation, and an ADS layer with asynchronous materialized views refreshed on a schedule. The asynchronous views act as persistent query caches, dramatically reducing latency.

To handle high‑cardinality distinct counts, Didi leverages StarRocks' BITMAP and HyperLogLog types, implementing a global dictionary that maps string keys to auto‑increment IDs using partial updates and the dict_mapping function. This enables efficient bitmap aggregation and reduces storage overhead.

View creation is optimized by distinguishing between additive and non‑additive dimensions, reducing the combinatorial explosion of view numbers from 2ⁿ to 2⁽ⁿ⁻ᵐ⁾ where m is the count of additive dimensions.

StarRocks also provides transparent view acceleration, automatically rewriting queries to hit the appropriate materialized view without user‑level SQL changes.

The final section summarizes results: query latency reduced by 80 %, resource consumption cut by 95 %, and QPS capacity increased by an order of magnitude. Remaining challenges include complex data pipelines, maintenance overhead, and the eventual consistency of asynchronous views. Future work focuses on improving BITMAP performance, reducing optimizer overhead, and automating materialized view creation based on query patterns.

performance optimizationBig DataStarRocksdata-platformOLAPmaterialized view
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.