Big Data 15 min read

Design Principles and Architecture of Apache Kylin for Sub‑Second OLAP Queries

This article explains how Apache Kylin, an open‑source distributed analytics engine built on Hadoop/Spark, achieves sub‑second OLAP query performance through pre‑computed cubes, a layered cuboid generation algorithm, bitmap‑based distinct counting, dimension optimization techniques, and tight integration with HBase for storage and fast SQL querying.

Xueersi Online School Tech Team

Sep 27, 2019

Design Principles and Architecture of Apache Kylin for Sub‑Second OLAP Queries

What is Apache Kylin

Apache Kylin™ is an open‑source distributed analytical engine that provides a SQL interface and OLAP capabilities on top of Hadoop/Spark, enabling interactive, sub‑second queries over massive datasets.

Main Features

Accelerated queries with sub‑second response time via pre‑computed aggregations.

Standard SQL interface compatible with Hive tables.

Interactive query experience comparable to traditional BI tools.

Scalable architecture with high throughput.

Seamless integration with BI products such as Tableau, PowerBI, QlikSense, Hue, and Superset.

Core Components

REST Server : Entry point for applications, exposing query, cube build, metadata, and permission APIs.

Query Engine : Parses user SQL, interacts with other components, and returns results.

Routing : Transforms SQL execution plans into cube look‑ups stored in HBase.

Metadata Manager : Manages all metadata, especially cube definitions, stored in HBase.

Cube Build Engine : Executes offline tasks (Shell, Java API, MapReduce) to generate cubes.

Storage Engine : Persists cuboids as key‑value pairs in HBase.

Pre‑Computation Principle

Kylin achieves millisecond‑level queries by pre‑computing all possible aggregation results (cubes) offline and storing them in HBase. Queries are answered by looking up these pre‑computed results instead of scanning raw Hive tables.

Cuboid Generation (By‑Layer Algorithm)

The algorithm builds cuboids layer by layer, aggregating from the most detailed level to higher levels while preserving distinct‑count information using BitMaps.

Example SQL for the base layer (device, subject, grade):

select device,subject,grade,count(uid) as pv,count(distinct uid) as uv
from visit_log
group by device,subject,grade;

Resulting cuboid rows include PV and UV values, where UV is stored as a bitmap (e.g., 1(001)).

Higher‑level cuboids are derived by summing PV values and performing bitmap OR operations for UV to retain accurate distinct counts.

Bitmap‑Based Precise Distinct Counting

Each distinct user ID is represented by a bit in a RoaringBitmap. Summing PV is straightforward, while UV requires bitmap OR across contributing cuboids to avoid double‑counting.

Dimension Optimization

Kylin provides several optimization methods to reduce the exponential number of cuboids:

Aggregation groups – only dimensions in the group participate in cuboid generation.

Derived dimensions – dimensions inferred from primary keys reduce combinatorial explosion.

Mandatory dimensions – dimensions that must appear in every cuboid (e.g., date).

Hierarchical dimensions – dimensions with a natural hierarchy (country → province → city).

Joint dimensions – multiple dimensions bundled as a single dimension.

HBase Storage and Query Example

Cuboid rows are stored in HBase with a RowKey composed of a cuboid‑ID (binary flags for present dimensions) followed by the dimension values. Example RowKey for (IOS,1,1) with cuboid‑ID 111 is 111+IOS1+1, and the stored values are PV=2, UV=1.

SQL query:

select device, count(distinct uid)
from visit_log
where grade=1 and subject=1
group by device;

Kylin parses the query, identifies the involved dimensions (device, grade, subject), locates the corresponding rows in HBase (RowKey prefix 111), and retrieves the pre‑computed UV bitmap to produce the result instantly.

Dimension Optimization Techniques Summary

Aggregation groups limit cuboid generation to selected dimensions.

Derived dimensions collapse many columns into a single logical dimension.

Mandatory dimensions halve the cuboid space.

Hierarchical dimensions reduce the space from 2ⁿ to n+1.

Joint dimensions can reduce the space to a single cuboid for tightly coupled dimensions.

These techniques together control the number of cuboids, improve build time, and accelerate query performance.

Author

Yang Jun, Data Algorithm Engineer at TAL Education Group, responsible for real‑time platform construction.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data SQL HBase OLAP Apache Kylin Cube Precomputation

Written by

Xueersi Online School Tech Team

The Xueersi Online School Tech Team, dedicated to innovating and promoting internet education technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.