How Apache Kylin Enables Sub‑Second OLAP on Massive Data Sets
Apache Kylin leverages pre‑computed OLAP cubes on Hadoop/Spark/Flink to deliver sub‑second query responses for massive datasets, detailing its architecture, integration with BI platforms, user security, cube building, monitoring, and storage using HBase, illustrating how it overcomes big‑data analytical challenges.
Research Background
With the rapid growth of mobile Internet, IoT, big data, and AI, data has become the most valuable asset and the foundation for business decisions. Enterprises face data silos, inconsistent data, scattered data assets, slow report queries, and rising costs as data volumes explode, making fast, valuable insight extraction a critical challenge.
Pre‑Computation Concept
Statistical results are the primary goal of big‑data queries, while raw records are rarely needed. By pre‑aggregating results during data ingestion, systems can answer queries using these pre‑computed values, sacrificing some flexibility for dramatic performance gains and achieving near‑second response times on massive datasets.
Apache Kylin Overview
Apache Kylin is an open‑source, distributed analytical data warehouse that provides SQL query interfaces and multi‑dimensional OLAP capabilities on top of Hadoop, Spark, or Flink. Through extensive pre‑computation, Kylin breaks the linear relationship between query time and data size, enabling sub‑second queries on billion‑row tables.
BI Platform Integration Goals
The BI platform integrates Kylin to provide unified user and permission management, a consistent UI, and extended features that adapt Kylin to the platform’s needs. It combines SparkSQL, FlinkSQL, Presto, and other engines via intelligent routing, delivering a one‑stop big‑data OLAP solution.
System Architecture
The architecture consists of four main components:
Cube Build Engine : Supports MapReduce, Spark, Flink for building data cubes.
Rest Server : Exposes REST, JDBC, and ODBC interfaces for query submission.
Query Engine : Parses SQL, generates execution plans, forwards queries to HBase, and returns results.
Storage Engine : Uses the distributed column‑oriented database HBase as the underlying store.
User and Permission Management
Kylin’s web module is built with the Spring framework and secures access via Spring Security. It supports three authentication modes—custom testing, LDAP, and SAML—providing flexible identity verification for enterprise environments.
Data Model and Cube Construction
BI data subjects are modeled from source metadata, allowing drag‑and‑drop visual modeling. Each cube links to a data model and supports incremental builds by specifying a partition column, avoiding re‑processing of historical data. Dimension tables smaller than 300 MB can be cached as in‑memory snapshots to improve efficiency.
Cube Configuration and Feature Enhancements
Unified page layout and Chinese language support.
Centralized security and permission control.
Enhanced cube management and query interfaces.
Default build engine switched to Flink for faster processing.
Cube Monitoring
Kylin provides task logs, alerts, progress bars, and detailed step‑by‑step status. Operators can view overall cube counts, storage usage, and individual task states such as Disabled, ERROR, or Ready. Control actions include Resume, Discard, Build, Refresh, and Merge.
Query Execution
Once a cube reaches the READY state, users can query it with standard SQL SELECT statements. Queries must match the cube’s defined dimensions and measures; otherwise, Kylin cannot use the pre‑computed data.
Storage Engine
Kylin’s plugin architecture enables seamless integration with HBase, providing strong scalability for petabyte‑scale datasets. Since version 1, Kylin tightly couples with Hadoop MapReduce, Hive as the data source, and HBase as the storage layer.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
IT Architects Alliance
Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
