Comprehensive Comparison of Apache Kylin and Apache Doris: Architecture, Data Models, Storage, Query, and Operations
This article provides an in‑depth technical comparison of Apache Kylin and Apache Doris, covering their system architectures, aggregation and detail data models, storage engines, data import processes, query execution, deduplication, metadata handling, performance, high availability, maintainability, usability, schema‑change capabilities, features, and community ecosystems.
1. System Architecture
Kylin follows a pre‑computation (cube) model that uses space‑for‑time to accelerate fixed‑pattern OLAP queries, with JobServer building cubes via MapReduce/Spark and QueryServer handling SQL parsing and HBase scans. Doris is an MPP OLAP system built on Google Mesa, Apache Impala, and ORCFile, consisting of a Frontend (FE) for query planning and a Backend (BE) for execution and storage.
2. Data Models
2.1 Kylin Aggregation Model
Kylin separates dimension and metric columns, aggregates metrics using functions such as SUM, COUNT, MIN, MAX, and distinct‑count, and stores Cuboid+dimension as HBase row keys with metrics as values.
2.2 Doris Aggregation Model
Doris adopts a similar model where dimensions are called Keys and metrics are Values; it introduces a special Replace function for point‑updates, though it cannot be pre‑aggregated.
2.3 Cuboid vs. RollUp
Kylin Cuboids and Doris RollUp tables are both materialized views or indexes that the system selects automatically during query execution.
2.4 Doris Detail Model
Doris also provides a non‑aggregated detail model that requires specifying sort columns; data is partitioned by date and bucketed, enabling efficient range scans.
3. Storage Engines
Kylin stores Cuboid data in HBase; each Segment maps to an HBase table, which is split into Regions and HFiles. Doris uses a columnar storage format inspired by ORC, with tablets as the smallest physical unit, secondary partitioning, and prefix indexes for fast key lookups.
4. Data Import
Kylin's import pipeline includes building a wide Hive table, dictionary construction, multi‑level Cuboid generation, HFile creation, loading into HBase, and metadata updates. Doris separates ETL (type/format validation, tablet splitting, sorting, aggregation) from LOADING (tablet data pull, format conversion, index generation) followed by metadata refresh.
5. Query Processing
Kylin executes a scatter‑gather model: SQL is parsed, optimized, compiled to code, and HBase scans are performed with optional coprocessor aggregation before final result merging. Doris uses an Impala‑based MPP engine: FE generates a single‑node plan, then distributes it into PlanFragments with ExchangeNodes to minimize data movement, and BE nodes execute Scan, Join, Aggregation, etc.
6. Precise Distinct Counting
Kylin implements pre‑computed distinct counting using global dictionaries and RoaringBitmap. Doris performs on‑the‑fly distinct counting in two phases, illustrated by the following SQL example:
SELECT a, COUNT(DISTINCT b, c), MIN(d), COUNT(*) FROM T GROUP BY a7. Metadata Management
Kylin stores metadata as JSON rows in HBase, enabling horizontal scaling but requiring HBase even with a pluggable storage architecture. Doris keeps metadata in memory, offering fast access with limited scalability.
8. Performance
Kylin’s speed stems from pre‑computed cubes (scan + filter). Doris benefits from in‑memory metadata, pre‑aggregated roll‑up tables, MPP execution, vectorized processing, columnar storage, and prefix indexes.
9. High Availability
Kylin achieves HA for JobServer via ZooKeeper and for QueryServer via load balancers, but overall HA depends on the underlying Hadoop ecosystem. Doris provides HA for FE using a Paxos‑like protocol (BDB‑JE) and replicates tablets across BE nodes.
10. Maintainability
Kylin deployment requires a full Hadoop stack (HDFS, HBase, Hive, Spark, Yarn, ZooKeeper). Doris only needs FE and BE components. Operational complexity is higher for Kylin due to many dependent services.
11. Usability
Kylin offers HTTP, JDBC, and ODBC interfaces; Doris uses the MySQL protocol, allowing existing MySQL tools to connect directly. Learning Kylin involves understanding cuboids, dimensions, row‑key design, and Hadoop job logs, whereas Doris requires grasping aggregation vs. detail models, prefix indexes, and roll‑up tables.
12. Schema Change
Kylin requires full data re‑build for any cube schema change. Doris supports online schema changes with three modes: direct (full re‑load), sorted (re‑sort data), and linked (metadata‑only change, e.g., adding columns).
13. Features & Community
Both systems support roll‑up tables; Kylin can emulate detail queries by building a base cuboid with all columns. Doris’s community is nascent (mainly Baidu), while Kylin has a mature, China‑driven open‑source ecosystem.
14. Conclusion
The article objectively contrasts Kylin and Doris across architecture, data modeling, storage, ingestion, query, deduplication, metadata, performance, HA, maintainability, usability, schema evolution, and community, providing a foundation for selecting the appropriate OLAP solution based on specific requirements.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
