Big Data 14 min read

Kylin Cube Construction Principles and Optimization Techniques

This article explains the fundamentals of Kylin Cube construction—including dimensions, measures, Cuboid generation, layer-by-layer and in‑memory building algorithms, storage mechanisms, and various optimization strategies such as derived dimensions, aggregation groups, row‑key design, and concurrency granularity—providing a comprehensive guide for big‑data OLAP practitioners.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Kylin Cube Construction Principles and Optimization Techniques

Kylin Cube Construction Principles

维度:即观察数据的角度

A dimension is the perspective from which data is observed, e.g., gender or hire date in employee data. Records sharing the same dimension values are aggregated for calculations.

度量:即被聚合(观察)的统计值,也就是聚合运算的结果

A measure is the aggregated statistical value, such as the count of employees per gender.

Cube and Cuboid

All fields in a data model are classified as either dimensions or measures. For a model with n dimensions, there are 2ⁿ possible combinations, each materialized as a Cuboid. The full set of Cuboids forms a Cube.

Example: an e‑commerce sales dataset with dimensions time, item, location, supplier and measure sales amount yields 16 Cuboids (4 one‑dimensional, 6 two‑dimensional, 4 three‑dimensional, plus 0‑D and 4‑D).

Cube Construction Algorithms

Layer‑by‑layer algorithm builds the Cube from the highest‑dimensional Cuboid down to lower dimensions, using the result of the previous layer to compute the next. Each layer corresponds to a MapReduce job; an n -dimensional Cube requires at least n jobs.

Advantages: leverages MapReduce’s sorting and shuffle, clear code, stable on Hadoop clusters. Disadvantages: many MapReduce jobs for high‑dimensional Cubes, heavy shuffle traffic, extensive HDFS read/write, and an extra job to convert output to HBase HFiles.

In‑memory fast construction algorithm (inmem) (also called “by segment” or “by split”) lets each Mapper pre‑aggregate all Cuboid combinations for its data segment, emitting a complete small Cube segment. Reducers then merge these segments into the final Cube, completing all layers in a single MapReduce round.

Key differences from the old algorithm: (1) Mappers perform in‑memory pre‑aggregation, reducing shuffle volume and eliminating the need for a Combiner; (2) only one MapReduce job is required, greatly reducing scheduling overhead.

Cube Storage Principle

Kylin stores Cuboid data in HBase using a RowKey composed of all dimensions in a defined order. Proper RowKey design improves query filtering, reduces I/O, and speeds up lookups.

Kylin Cube Optimization

Derived Dimensions (derived dimension)

Derived dimensions replace non‑key attributes in a dimension table with the table’s primary key (the foreign key in the fact table). Kylin records the mapping so that queries can translate the key back to the original attributes for real‑time aggregation. Use them only when the translation cost is low; otherwise, keep the original dimensions.

Aggregation Groups

Aggregation groups are a pruning tool that partitions dimensions into logical groups. Each group independently contributes a set of Cuboids, and the union of all groups’ Cuboids forms the final materialized set. Three dimension types can be defined within a group:

Mandatory dimension : must appear in every Cuboid generated by the group.

Hierarchy dimension : dimensions appear in a hierarchical order, e.g., ( ), (D1), (D1,D2)…(D1,…,Dn).

Joint dimension : a set of dimensions that either all appear together or none appear.

Aggregation groups are configured in the Cube Designer’s Advanced Settings.

RowKey Optimization

RowKey design principles:

Place dimensions used in WHERE filters at the front.

Place high‑cardinality dimensions before low‑cardinality ones.

Concurrency Granularity Optimization

When a Cuboid exceeds a size threshold, Kylin splits it into multiple partitions (regions) to enable parallel reads. The number of partitions is derived from the estimated Segment size and the configuration kylin.hbase.region.cut (default 5 GB). Users can further control the minimum and maximum number of regions with kylin.hbase.region.count.min and kylin.hbase.region.count.max, tailoring concurrency per Cube.

All the above techniques help reduce Cube size, improve build efficiency, and accelerate query performance in large‑scale big‑data environments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

optimizationBig DataOLAPCubeKylinaggregationrowKey
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.