How Doris Powers Meituan’s Real‑Time Data Warehouse: ROLAP vs MOLAP Lessons
This article examines Meituan’s data warehouse evolution, detailing the limitations of MOLAP with Kylin, the adoption of Doris‑driven ROLAP using MPP technology, and the practical optimizations—such as join predicate pushdown, concurrent execution, colocate join, and bitmap aggregation—that improve real‑time analytics and reduce costs.
Background and Motivation
Meituan’s food‑delivery platform processes billions of rows of data daily. Traditional data‑warehouse stacks use Hadoop/Spark for batch processing and MySQL or Kylin (MOLAP) for interactive queries. Pre‑computed MOLAP cubes become costly and inflexible when dimensions such as merchant “business circles” change frequently, requiring full recomputation of historical data.
Challenges in Data Production
Daily full refresh of historical data eliminates incremental benefits.
Back‑filling >1 billion rows per day consumes >3 hours of compute and >1 TB of storage.
Only ~20 % of historical data is actually queried, yet the system must support half‑year retrospectives.
MOLAP cannot answer detailed row‑level queries.
Solution: MPP‑Based ROLAP Engine (Apache Doris)
To avoid costly pre‑computation, Meituan adopted a “compute‑on‑demand” approach powered by an MPP engine. Doris combines Google Mesa data‑model concepts, Apache Impala’s query engine, and ORC storage, providing both aggregated and detailed query capabilities.
Dual‑Engine Architecture
Meituan runs Kylin (MOLAP) and Doris (ROLAP) side‑by‑side. Queries are routed based on workload: stable, incremental analyses use Kylin; flexible, ad‑hoc or detailed analyses use Doris.
Key Technical Trade‑offs
MOLAP disadvantages
Complex application‑layer models and heavy pre‑processing.
Intricate Kylin configuration and pruning strategies.
Inability to query detailed data directly; requires sync to a DBMS.
High production cost due to extensive pre‑computation.
ROLAP advantages (Doris)
Simplified star‑schema at a stable granularity (e.g., merchant level).
View‑based encapsulation reduces data redundancy and operational overhead.
Supports both aggregated and row‑level queries.
Lightweight, standardized models lower production cost.
Doris Overview and Capabilities
Frontend (FE) handles SQL parsing, planning, optimization, and metadata.
Backend (BE) executes queries and stores data in ORC format.
MySQL‑compatible protocol and standard SQL.
Rollup tables with intelligent routing.
Robust join strategies, including colocate join.
Online schema changes, range/hash secondary partitioning.
High‑concurrency point queries and high‑throughput ad‑hoc analytics.
Support for both batch and real‑time ingestion.
Performance Optimizations in Doris
1. Join Predicate Push‑Down (Transitive Closure)
select * from t1 join t2 on t1.id = t2.id where t1.id = 1;Doris infers t2.id = 1 and pushes the predicate to the scan of t2, dramatically reducing scanned partitions.
2. Concurrent Execution Instances
Increasing per‑node parallelism (e.g., 5 instances per operator) improves latency 3‑5× for large scans.
3. Colocate Join
Data is partitioned by join key so that joins execute locally without network shuffle. Implementation steps include persisting colocate table metadata, shifting hash‑join granularity from server‑level to bucket‑level, and validating join conditions.
4. Bitmap‑Based Precise Distinct Counting
During ingestion Doris stores bitmap aggregates for dimension columns. Queries that require distinct counts can read the bitmap directly, avoiding full scans and reducing CPU/IO load.
Real‑World Impact in Meituan’s Warehouse
A 20‑node BE + 3‑node FE Doris cluster delivers:
Millisecond‑level response for dozens of analytical products.
Second‑level latency for multi‑billion‑row joins when colocate join is used.
Sub‑second merchant‑level drill‑down queries.
7‑day trend analysis completed in 2‑3 seconds; scalability limited only by cluster size.
Improved reliability, availability, and scalability over a year of production use.
Near‑Real‑Time Scenarios
For marketing activities requiring sub‑minute freshness, Meituan built a Lambda‑style pipeline: Kafka streams feed Doris, providing 10‑15 minute “near‑real‑time” data for daily and weekly comparisons.
Summary and Reflections
Meituan’s experience shows that no single data‑production model fits all use cases. An MPP‑driven ROLAP approach with Doris efficiently handles aggregated and detailed queries, changing dimensions, and near‑real‑time workloads, while MOLAP (Kylin) remains valuable for stable, incremental analyses. The open‑source Doris engine continues to evolve, offering a compelling alternative to Kylin, Druid, and Elasticsearch for large‑scale analytics.
References:
https://github.com/apache/incubator-doris
https://blog.bcmeng.com/post/apache-kylin-vs-baidu-palo.html
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
