Design Principles and Architecture of Apache Kylin for Sub‑Second OLAP Queries
This article explains how Apache Kylin, an open‑source distributed analytics engine built on Hadoop/Spark, achieves sub‑second OLAP query performance through pre‑computed cubes, a layered cuboid generation algorithm, bitmap‑based distinct counting, dimension optimization techniques, and tight integration with HBase for storage and fast SQL querying.
What is Apache Kylin
Apache Kylin™ is an open‑source distributed analytical engine that provides a SQL interface and OLAP capabilities on top of Hadoop/Spark, enabling interactive, sub‑second queries over massive datasets.
Main Features
Accelerated queries with sub‑second response time via pre‑computed aggregations.
Standard SQL interface compatible with Hive tables.
Interactive query experience comparable to traditional BI tools.
Scalable architecture with high throughput.
Seamless integration with BI products such as Tableau, PowerBI, QlikSense, Hue, and Superset.
Core Components
REST Server : Entry point for applications, exposing query, cube build, metadata, and permission APIs.
Query Engine : Parses user SQL, interacts with other components, and returns results.
Routing : Transforms SQL execution plans into cube look‑ups stored in HBase.
Metadata Manager : Manages all metadata, especially cube definitions, stored in HBase.
Cube Build Engine : Executes offline tasks (Shell, Java API, MapReduce) to generate cubes.
Storage Engine : Persists cuboids as key‑value pairs in HBase.
Pre‑Computation Principle
Kylin achieves millisecond‑level queries by pre‑computing all possible aggregation results (cubes) offline and storing them in HBase. Queries are answered by looking up these pre‑computed results instead of scanning raw Hive tables.
Cuboid Generation (By‑Layer Algorithm)
The algorithm builds cuboids layer by layer, aggregating from the most detailed level to higher levels while preserving distinct‑count information using BitMaps.
Example SQL for the base layer (device, subject, grade):
select device,subject,grade,count(uid) as pv,count(distinct uid) as uv
from visit_log
group by device,subject,grade;Resulting cuboid rows include PV and UV values, where UV is stored as a bitmap (e.g., 1(001)).
Higher‑level cuboids are derived by summing PV values and performing bitmap OR operations for UV to retain accurate distinct counts.
Bitmap‑Based Precise Distinct Counting
Each distinct user ID is represented by a bit in a RoaringBitmap. Summing PV is straightforward, while UV requires bitmap OR across contributing cuboids to avoid double‑counting.
Dimension Optimization
Kylin provides several optimization methods to reduce the exponential number of cuboids:
Aggregation groups – only dimensions in the group participate in cuboid generation.
Derived dimensions – dimensions inferred from primary keys reduce combinatorial explosion.
Mandatory dimensions – dimensions that must appear in every cuboid (e.g., date).
Hierarchical dimensions – dimensions with a natural hierarchy (country → province → city).
Joint dimensions – multiple dimensions bundled as a single dimension.
HBase Storage and Query Example
Cuboid rows are stored in HBase with a RowKey composed of a cuboid‑ID (binary flags for present dimensions) followed by the dimension values. Example RowKey for (IOS,1,1) with cuboid‑ID 111 is 111+IOS1+1 , and the stored values are PV=2, UV=1.
SQL query:
select device, count(distinct uid)
from visit_log
where grade=1 and subject=1
group by device;Kylin parses the query, identifies the involved dimensions (device, grade, subject), locates the corresponding rows in HBase (RowKey prefix 111), and retrieves the pre‑computed UV bitmap to produce the result instantly.
Dimension Optimization Techniques Summary
Aggregation groups limit cuboid generation to selected dimensions.
Derived dimensions collapse many columns into a single logical dimension.
Mandatory dimensions halve the cuboid space.
Hierarchical dimensions reduce the space from 2ⁿ to n+1.
Joint dimensions can reduce the space to a single cuboid for tightly coupled dimensions.
These techniques together control the number of cuboids, improve build time, and accelerate query performance.
Author
Yang Jun, Data Algorithm Engineer at TAL Education Group, responsible for real‑time platform construction.
Xueersi Online School Tech Team
The Xueersi Online School Tech Team, dedicated to innovating and promoting internet education technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.