Big Data 53 min read

Comprehensive Guide to Apache Kylin: Background, Architecture, Installation, Optimization, and Real‑World Use Cases

This article provides an in‑depth overview of Apache Kylin, covering its history, mission, core MOLAP principles, technical architecture, step‑by‑step installation (Docker and Hadoop), performance tuning, advanced cube settings, and detailed case studies from major companies such as Baidu, Lianjia, and Didi.

Big Data Technology & Architecture

Jun 21, 2021

Comprehensive Guide to Apache Kylin: Background, Architecture, Installation, Optimization, and Real‑World Use Cases

Background and Mission

Apache Kylin originated from eBay's BI‑on‑Hadoop project in 2013, open‑sourced in 2014, and became an Apache top‑level project in 2015. Its mission is to deliver ultra‑fast OLAP queries on massive datasets, enabling sub‑second, SQL‑like analytics.

Working Principle

Kylin implements a MOLAP cube model. Users define dimensions and measures, Kylin pre‑computes all possible cuboids (materialized views) and stores them, allowing queries to be answered by reading these pre‑aggregated results instead of scanning raw data.

Dimension and Measure Basics

Dimensions represent the angles of analysis (e.g., time, location). Measures are the numeric values to be aggregated (e.g., sales amount, transaction count).

Cube and Cuboid

For N dimensions there are 2ⁿ possible cuboids. Each cuboid stores aggregated results for a specific combination of dimensions. The full set of cuboids constitutes a cube.

select Time, Location, Sum(GMV) as GMV from Sales group by Time, Location

Technical Architecture

Kylin consists of an online query layer and an offline build layer. Data sources (HDFS, Hive, Kafka, RDBMS) feed the build engine, which creates cubes stored primarily in HBase. The query layer exposes REST, JDBC, and ODBC interfaces that translate user SQL into cube‑based execution plans.

Core Concepts

Key concepts include data warehouses, OLAP vs. OLTP, BI, dimensional modeling (star and snowflake schemas), fact tables, dimension tables, and the relationship between dimensions and measures.

Quick Start

Docker‑Based Installation (No Hadoop Prerequisite)

Pull the official image and run a container with the required ports:

docker pull apachekylin/apache-kylin-standalone:3.1.0

docker run -d \
  -m 8G \
  -p 7070:7070 -p 8088:8088 -p 50070:50070 \
  -p 8032:8032 -p 8042:8042 -p 16010:16010 \
  apachekylin/apache-kylin-standalone:3.1.0

After startup, access Kylin at http://127.0.0.1:7070/kylin/ and use the sample cube to explore functionality.

Hadoop‑Based Installation

Download the binary package, set environment variables (JAVA_HOME, HADOOP_HOME, etc.), run check-env.sh, then start Kylin with bin/kylin.sh start. Create projects, load Hive tables, define models, and build cubes via the web UI.

Optimization and Advanced Settings

Resource Tuning

Adjust MapReduce and HBase parameters (e.g., mapreduce.map.java.opts, HBase region size, coprocessor memory) to improve build speed and query latency.

Cube Advanced Settings

Use aggregation groups, joint dimensions, hierarchy dimensions, mandatory dimensions, and derived dimensions to prune unnecessary cuboids and control cube size. Example: setting a mandatory dimension halves the number of cuboids.

Real‑World Use Cases

Baidu Maps

Deployed ~80 cubes covering 50 billion rows, achieving sub‑second query latency for complex, multi‑dimensional analytics.

Lianjia

Operates a 6‑node Kylin cluster (3 build, 3 query) with 500+ cubes, 200 TB storage, and average query latency < 500 ms for 70 % of requests.

Didi

Maintains >700 cubes (30 TB) serving 2 000+ daily build jobs; 80 % of queries complete within 500 ms, supporting a wide range of business lines.

Conclusion

Apache Kylin provides a scalable, high‑performance OLAP solution for big‑data environments, offering flexible deployment options, extensive tuning knobs, and proven success in large‑scale production systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Docker Hive OLAP presto Spark Apache Kylin Cube

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background and Mission

Working Principle

Dimension and Measure Basics

Cube and Cuboid

Technical Architecture

Core Concepts

Quick Start

Docker‑Based Installation (No Hadoop Prerequisite)

Hadoop‑Based Installation

Optimization and Advanced Settings

Resource Tuning

Cube Advanced Settings

Real‑World Use Cases

Ba​idu Maps

Lianjia

Didi

Conclusion

Big Data Technology & Architecture

How this landed with the community

Was this worth your time?

0 Comments

Baidu Maps