Comprehensive Guide to Apache Kylin: Architecture, Concepts, Cube Design and Optimization
This article provides an in‑depth overview of Apache Kylin’s pre‑computation architecture, data‑warehouse concepts, step‑by‑step cube creation from Hive tables, and advanced optimization techniques such as derived dimensions, aggregation groups, and HBase row‑key encoding to achieve sub‑second OLAP queries on massive datasets.
Apache Kylin adopts a "pre‑computation" model: users define query dimensions in advance, Kylin computes the results and stores them in HBase, delivering sub‑second query responses for massive data sets by trading space for time.
Kylin Architecture
Hadoop/Hive: Kylin is a MOLAP system that pre‑computes Hive data using MapReduce or Spark.
HBase: Stores the OLAP cube data for interactive multi‑dimensional queries.
Rest Server: Provides RESTful APIs.
Query Engine: Uses the open‑source Calcite framework for SQL parsing.
Routing: Transforms execution plans into cube cache queries.
Metadata: Stores most of Kylin’s metadata.
Cube Build Engine: Responsible for creating cubes during pre‑computation.
1. Concepts
Data Warehouse : The core of Business Intelligence (BI) that integrates historical data from multiple sources to support enterprise decision‑making, often containing redundant data for multi‑dimensional analysis.
OLAP (Online Analytical Processing) enables multi‑dimensional analysis of data for decision support, contrasting with OLTP (Online Transaction Processing) which focuses on CRUD operations.
Typical OLAP operations include:
Drill‑down: Move from higher‑level aggregates to finer details.
Roll‑up: Aggregate finer details into higher‑level summaries.
Slice: Select a specific value of a dimension.
Dice: Select a range or a set of values of a dimension.
Pivot: Swap rows and columns.
OLAP Operations
Dimensions and Measures : Dimensions are attributes such as time or location; measures are numeric calculations like total sales or user count.
Fact Table and Dimension Table : Fact tables store event records; dimension tables store attribute values and are linked to fact tables, reducing redundancy.
Model Types :
Star schema: One fact table linked to multiple dimension tables.
Snowflake schema: Normalized dimension tables forming deeper hierarchies.
Constellation schema: Multiple fact tables sharing dimension tables.
2. Preparation
1. Prepare data in Hive: Data to be analyzed must be stored as Hive tables (or views) so Kylin can import them and build cubes. Views allow preprocessing such as adding dimensions.
2. Design dimension tables: keep cardinality moderate, ensure primary keys are unique, and avoid using Hive views as dimension tables.
High cardinality (UHC – Ultra High Cardinality) dimensions can cause large cube sizes; Kylin supports them but they require careful handling.
A cube should not exceed 30 dimensions to avoid the "dimension disaster".
3. Cube Design Process
3.1 Import Hive Table
In the Kylin UI, select Mode → DataSource → Load Hive Table to import the Hive definition.
3.2 Create Data Model
Click New → New Model to define a star or snowflake model. Choose a fact table, add dimension tables, specify join type (inner or left) and primary/foreign keys.
Select columns that will serve as dimensions or measures; measures must come from the fact table.
3.3 Design Cube
1) Provide a unique cube name and description.
2) Add dimensions (including derived dimensions) one by one.
3) Add measures – Kylin creates a default COUNT(1) measure; additional measures such as SUM, MIN, MAX, COUNT DISTINCT, TOP_N, RAW can be added.
4) Configure cube refresh settings: auto‑merge thresholds, volatile range, retention period.
5) Advanced settings – define aggregation groups and row‑key encoding.
Kylin stores cubes in HBase as key‑value pairs; the row‑key is a concatenation of dimension values, encoded (default dictionary, also integer or fixed‑length). Dictionary encoding is compact for low‑cardinality dimensions but may cause memory pressure for high‑cardinality ones.
Supported encodings include Date (3 bytes), Time (4 bytes), Integer (custom length), Fixed‑length.
Row‑key order matters: place mandatory dimensions first, then high‑cardinality filter dimensions, followed by low‑cardinality ones to improve scan efficiency.
Other important settings:
Mandatory Cuboids: whitelist of dimension combinations to guarantee construction.
Cube Engine: choose MapReduce for complex measures (COUNT DISTINCT, TOP_N) or Spark for simple SUM/MIN/MAX.
Advanced ColumnFamily: separate heavy measures into additional column families to reduce I/O.
Example of adding a custom HBase property: kylin.hbase.region.cut=2 Other region settings: kylin.hbase.region.count.min and
kylin.hbase.region.count.max4. Cube Building
The build process (full or incremental) follows the same steps:
Create temporary Hive flat tables.
Re‑balance data to avoid skew.
Generate distinct column files for fact tables.
Build dimension dictionaries.
Save cuboid statistics.
Create HBase tables.
Build the basic cuboid.
Build N‑dimensional cuboids.
In‑memory cuboid construction.
Convert results to HFile.
Load HFile into HBase.
Update cube metadata.
Garbage‑collect temporary files.
Building N‑dimensional cuboids is the most time‑consuming phase because the number of cuboids grows combinatorially; after roughly half the dimensions are processed, the pace accelerates.
5. Cube Pruning and Optimization
Cube bloat (0%–1000% growth) is caused by excessive dimensions, high‑cardinality dimensions, or heavy measures like COUNT DISTINCT.
Optimization strategies:
Derived Dimensions : Replace non‑key dimensions with the primary key of the dimension table, reducing the number of cuboids.
Aggregation Groups : Split dimensions into independent groups; each group materializes cuboids only once, dramatically reducing total cuboid count (e.g., 2^(m+n) → 2^m + 2^n).
Mandatory Dimensions : Force certain dimensions to appear in every cuboid, eliminating unnecessary combinations.
Hierarchy : Define parent‑child dimension relationships (e.g., country → province → city) so intermediate levels can be skipped.
Joint Dimensions : Group dimensions that are always queried together, ensuring they appear together or not at all.
Isolate high‑cardinality dimensions in their own aggregation group to prevent explosion of cuboids involving them.
These techniques together keep cube size manageable while preserving query performance.
Author: 叫我不矜持
Source: https://www.jianshu.com/p/7906f428aaec
— THE END —
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
