Comparative Analysis of Kylin, Druid, and ClickHouse OLAP Engines
This article examines the data models and index structures of the three leading OLAP engines—Kylin, Druid, and ClickHouse—offering technical explanations, code examples, and a concise comparison to help readers understand their strengths, limitations, and suitable use cases.
Introduction Kylin, Druid, and ClickHouse are currently the most popular OLAP engines. This article analyzes their core technologies from the perspectives of data model and index structure, and provides a brief comparison. Readers are expected to have some prior knowledge of these engines.
Kylin Data Model
Kylin transforms a two‑dimensional Hive table into a Cube and then stores the Cube in HBase, which involves two conversions.
The first conversion is the traditional cube operation. A Cube consists of CuboIds; each CuboId represents a fixed‑column data set, e.g., the combination of dimensions "A" and "B" corresponds to the following SQL query:
select A, B, sum(M), sum(N) from table group by A, BThe second conversion stores the Cube data in HBase. During this step, CuboId and dimension information are serialized into the rowkey, while metric columns form column families. The data is pre‑aggregated, and the following image shows how the Cube is stored in HBase.
Kylin Index Structure
Since Kylin stores data in HBase, its index is essentially HBase's index, a simplified B+‑tree. HFile indexes are clustered by rowkey and typically have two or three levels, with default node size of 64 KB. Lookup proceeds through the tree, using binary search within each node because the data inside nodes is ordered by rowkey.
Kylin Summary : Suitable for aggregation queries; pre‑aggregation makes Kylin extremely fast for group‑by queries, often scanning only one row. Query performance depends heavily on whether the CuboId is hit, leading to variability. HBase indexing resembles MySQL composite indexes, and the ordering of dimensions in the rowkey greatly impacts performance, so business experts should be involved in table design.
Druid Data Model
Druid’s model is simpler: it also performs pre‑aggregation, but instead of building cubes it groups by all dimensions directly, as illustrated below.
Druid Index Structure
Druid uses a custom column‑oriented storage format. Each column is stored in a separate logical (physical) file with start offsets. For dimension columns, Druid builds bitmap indexes: a dictionary of unique values is created, and for each value a bitmap of length equal to the total number of rows marks which rows contain that value. The bitmap acts as a reverse index, while the dictionary provides a forward index.
Example query:
select site, sum(pv) from xx where date=2020-01-01 and city='bj' group by siteBinary search the dictionary of the city column to find the bitmap for ‘bj’.
Intersect this bitmap with the bitmaps of other dimension values as needed.
The resulting bitmap points to the rows that satisfy the filter, allowing fast aggregation on the pv column.
Druid Summary : Optimized for aggregation queries but less suitable for scenarios with extremely high‑cardinality dimensions. It stores fully aggregated data (similar to Kylin’s base CuboId) and creates indexes for every dimension, resulting in consistently fast queries without the large performance swings seen in Kylin.
ClickHouse Index Structure (MergeTree Engine)
ClickHouse stores data in ordinary two‑dimensional tables, so the focus is on its index. ClickHouse also uses a column‑oriented layout where each column has its own file. The index works by selecting a subset of columns as primary (e.g., date and city) and sorting the entire data file by these columns, similar to a MySQL composite index. Every 8192 rows a “mark” is written, recording the offset of the block; the mark number is used to locate data within the file.
Illustration (table with columns date, city, action):
Example query:
select count(distinct action) where date=toDate(2020-01-01) and city='bj'Binary search primary.idx to obtain the set of marks (data blocks) that may contain matching rows.
Within those blocks, locate matching rows in the date and city columns and intersect the row‑id sets.
Convert the row‑ids to mark numbers and offsets, then read the corresponding portions of the action column file.
Perform the final aggregation.
Reverse lookup (using the filter) works efficiently only when the query predicates match the leftmost prefix of the sorting key; otherwise, large portions of the table may need to be scanned, causing significant performance variability.
ClickHouse Summary : The MergeTree family supports both detail‑level and aggregated workloads. Its index behaves like a MySQL composite index—highly efficient when the query prefix matches the sorting key, but potentially scanning the whole table otherwise. Table design therefore also requires business expertise, similar to Kylin.
Overall Summary
Kylin and Druid excel in aggregation scenarios; ClickHouse handles both detail and aggregation workloads.
Query efficiency ranking for aggregation: Kylin > Druid > ClickHouse.
Both Kylin and ClickHouse need business experts to design tables.
Kylin and ClickHouse may exhibit large performance fluctuations depending on query patterns.
ClickHouse has the most advanced vectorized execution; Druid offers limited vectorization; Kylin currently lacks vectorized computation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
