Kylin Cube Construction Principles and Optimization Techniques
This article explains the fundamentals of Kylin Cube construction—including dimensions, measures, Cuboid generation, layer-by-layer and in‑memory building algorithms, storage mechanisms, and various optimization strategies such as derived dimensions, aggregation groups, row‑key design, and concurrency granularity—providing a comprehensive guide for big‑data OLAP practitioners.
Kylin Cube Construction Principles
维度:即观察数据的角度A dimension is the perspective from which data is observed, e.g., gender or hire date in employee data. Records sharing the same dimension values are aggregated for calculations.
度量:即被聚合(观察)的统计值,也就是聚合运算的结果A measure is the aggregated statistical value, such as the count of employees per gender.
Cube and Cuboid
All fields in a data model are classified as either dimensions or measures. For a model with n dimensions, there are 2ⁿ possible combinations, each materialized as a Cuboid. The full set of Cuboids forms a Cube.
Example: an e‑commerce sales dataset with dimensions time, item, location, supplier and measure sales amount yields 16 Cuboids (4 one‑dimensional, 6 two‑dimensional, 4 three‑dimensional, plus 0‑D and 4‑D).
Cube Construction Algorithms
Layer‑by‑layer algorithm builds the Cube from the highest‑dimensional Cuboid down to lower dimensions, using the result of the previous layer to compute the next. Each layer corresponds to a MapReduce job; an n -dimensional Cube requires at least n jobs.
Advantages: leverages MapReduce’s sorting and shuffle, clear code, stable on Hadoop clusters. Disadvantages: many MapReduce jobs for high‑dimensional Cubes, heavy shuffle traffic, extensive HDFS read/write, and an extra job to convert output to HBase HFiles.
In‑memory fast construction algorithm (inmem) (also called “by segment” or “by split”) lets each Mapper pre‑aggregate all Cuboid combinations for its data segment, emitting a complete small Cube segment. Reducers then merge these segments into the final Cube, completing all layers in a single MapReduce round.
Key differences from the old algorithm: (1) Mappers perform in‑memory pre‑aggregation, reducing shuffle volume and eliminating the need for a Combiner; (2) only one MapReduce job is required, greatly reducing scheduling overhead.
Cube Storage Principle
Kylin stores Cuboid data in HBase using a RowKey composed of all dimensions in a defined order. Proper RowKey design improves query filtering, reduces I/O, and speeds up lookups.
Kylin Cube Optimization
Derived Dimensions (derived dimension)
Derived dimensions replace non‑key attributes in a dimension table with the table’s primary key (the foreign key in the fact table). Kylin records the mapping so that queries can translate the key back to the original attributes for real‑time aggregation. Use them only when the translation cost is low; otherwise, keep the original dimensions.
Aggregation Groups
Aggregation groups are a pruning tool that partitions dimensions into logical groups. Each group independently contributes a set of Cuboids, and the union of all groups’ Cuboids forms the final materialized set. Three dimension types can be defined within a group:
Mandatory dimension : must appear in every Cuboid generated by the group.
Hierarchy dimension : dimensions appear in a hierarchical order, e.g., ( ), (D1), (D1,D2)…(D1,…,Dn).
Joint dimension : a set of dimensions that either all appear together or none appear.
Aggregation groups are configured in the Cube Designer’s Advanced Settings.
RowKey Optimization
RowKey design principles:
Place dimensions used in WHERE filters at the front.
Place high‑cardinality dimensions before low‑cardinality ones.
Concurrency Granularity Optimization
When a Cuboid exceeds a size threshold, Kylin splits it into multiple partitions (regions) to enable parallel reads. The number of partitions is derived from the estimated Segment size and the configuration kylin.hbase.region.cut (default 5 GB). Users can further control the minimum and maximum number of regions with kylin.hbase.region.count.min and kylin.hbase.region.count.max, tailoring concurrency per Cube.
All the above techniques help reduce Cube size, improve build efficiency, and accelerate query performance in large‑scale big‑data environments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
