An Introduction to Apache Kylin: Architecture, Core Concepts, Installation, and Enterprise Use Cases
This article provides a comprehensive overview of Apache Kylin, covering its background, core OLAP concepts, technical architecture, installation steps, cube-building methods, real‑world enterprise deployments, and resources for further learning, illustrating how it enables sub‑second query performance on massive datasets.
Preface
With the rapid development of mobile Internet, IoT and other technologies, the amount of data generated by humanity has exploded, ushering in the era of big data. While Hadoop solved the storage problem, performing OLAP queries on massive datasets remains a major challenge.
Enterprise queries can be ad‑hoc or custom. Existing OLAP engines such as Hive, Presto and SparkSQL simplify analysis but are suited only for ad‑hoc scenarios; their response times grow with data volume and cannot guarantee sub‑second latency for custom, real‑time queries. Traditional approaches rely on pre‑computing results and storing them in relational databases, which becomes costly as data and business complexity increase.
Apache Kylin was created to address this pain point. Unlike MPP engines, Kylin adopts a pre‑computation model: users define query dimensions in advance, Kylin computes the results and stores them in HBase, delivering sub‑second query responses through a classic space‑for‑time trade‑off.
Kylin originated at eBay and was later contributed to the Apache Foundation; the core team now operates Kyligence. It is notable as one of the first Apache top‑level projects led by Chinese developers, reflecting the growing influence of Chinese open‑source contributions in the data domain.
1. Core Concepts
Data Warehouse
A Data Warehouse (DW) integrates data from multiple sources for BI analysis, differing from traditional OLTP databases which focus on transactional workloads and strict ACID compliance. In big‑data environments, Hive often serves as the warehouse.
OLAP
Online Analytical Processing (OLAP) enables multidimensional analysis of data, contrasting with OLTP which handles transactional operations.
Dimensions and Measures
Dimensions represent discrete attributes (e.g., time, device) used for grouping, while measures are continuous values (e.g., temperature) that are aggregated.
Cube and Cuboid
For a given data model, all possible dimension combinations (2^N for N dimensions) form Cuboids; each Cuboid stores aggregated measures as a materialized view. The collection of all Cuboids constitutes a Cube. Example SQL for a Cuboid:
select Time, Location, Sum(GMV) as GMV from Sales group by Time, Location
Fact Table and Dimension Table
Fact tables store large volumes of transactional records, while dimension tables store attribute values for those facts, enabling efficient joins and reducing redundancy.
Smaller fact tables.
Easier dimension management.
Dimension tables can be reused across multiple fact tables.
Star Schema
The star schema consists of a single fact table surrounded by multiple dimension tables, with no relationships among dimensions. Kylin currently supports only the star schema.
2. Apache Kylin Technical Architecture
Kylin consists of an online query layer and an offline build layer.
Offline Build : Data is sourced from Hive (or other supported sources) and must conform to a star schema. The build engine (MapReduce by default, Spark in beta) creates Cubes, which are stored in HBase.
Online Query : Users submit SQL via RESTful API, JDBC/ODBC, or the web UI. The query engine translates the logical plan into a physical plan that reads pre‑computed Cubes, avoiding access to the original data source. If a query is not pre‑defined, Kylin returns an error.
Kylin abstracts the data source, execution engine, and Cube storage, allowing easy replacement (e.g., Spark for MapReduce, Cassandra for HBase).
SQL interface: simple and familiar.
Supports massive datasets: performance depends on dimension cardinality, not data size.
Sub‑second response thanks to pre‑computation.
Horizontal scalability via cluster deployment.
Integration with visualization tools through JDBC/ODBC and RESTful APIs.
3. Installation and Usage
Kylin installation instructions are available on the official website. Two practical notes not covered there:
Kylin depends on Hadoop; using Hadoop in Standalone mode may cause Cube‑build failures. A virtual‑machine cluster is recommended.
In addition to HDFS and YARN, the jobhistoryserver must be started with the command sbin/mr-jobhistory-daemon.sh start historyserver.
After deployment, the quick‑start tutorial can verify the installation.
Kylin supports three Cube‑building modes:
Full build – rebuilds the entire Hive table.
Incremental build – builds only new data, using Segments identified by start and end timestamps.
Streaming build – consumes data from Kafka in micro‑batches, enabling near‑real‑time updates (available from v1.6).
When creating a model, the Partition Date Column must be specified; when creating a Cube, the Partition Start Date defines the first Segment.
4. Enterprise Application Cases
Baidu Maps
Baidu Maps adopted Kylin to provide millisecond‑level, multi‑dimensional analysis on billions of rows. Kylin solved three major pain points: massive data aggregation latency, complex filter conditions, and large time‑range queries, achieving sub‑second response times.
JD Cloud Sea
JD Cloud Sea uses Kylin to analyze API access logs (≈7 billion calls per day) with second‑level latency requirements. Kylin’s ability to handle huge data volumes, provide standard JDBC/ODBC interfaces, and integrate with BI tools made it the preferred solution.
5. Further Learning
Official documentation: http://kylin.apache.org/
"Apache Kylin Authority Guide" – a comprehensive book by the core development team (covers v1.5; streaming rebuilt in v1.6).
6. Conclusion
Apache Kylin fills the gap for OLAP on Hadoop by offering an easy‑to‑use, pre‑computation engine with sub‑second query performance. It has been adopted by eBay, Baidu, JD, Meituan and many others, proving its value in production environments. Practitioners needing fast, scalable analytics should consider trying Apache Kylin.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
