Design and Architecture of an Integrated BI Platform Using Apache Kylin for Large‑Scale OLAP
The article explains the challenges of big‑data analytics, introduces pre‑computation OLAP concepts, and details how Apache Kylin together with Spark, Flink, Presto and other components can be integrated into a BI platform to achieve near‑real‑time query performance on massive datasets.
With the rapid growth of mobile Internet, IoT, big data and AI, data has become a critical asset and the foundation for business decisions, leading many enterprises to pursue digital transformation. However, data silos, consistency issues, and the high cost of processing petabyte‑scale datasets make fast analytics a major challenge.
Traditional Hadoop solved storage and batch processing, but interactive query speed remained limited. SQL‑on‑Hadoop solutions such as Hive, Impala, Presto, Phoenix, Drill, SparkSQL and FlinkSQL introduced Massive Parallel Processing (MPP) and columnar storage, reducing query times from hours to minutes, yet still falling short of true interactive analysis.
Because most analytical queries only need aggregated results, the article proposes a pre‑computation approach: compute and store aggregates in advance so that queries can be answered from these materialized results, sacrificing some flexibility for massive performance gains.
Apache Kylin, an open‑source distributed analytical data warehouse, implements this idea on top of Hadoop/Spark/Flink, delivering sub‑second query latency on billions of rows by breaking the linear relationship between data volume and query time.
The BI platform integrates Kylin for unified user authentication (via Spring Security), permission management, and a seamless UI. It combines multiple query engines (SparkSQL, FlinkSQL, Presto) with intelligent routing to leverage each engine’s strengths.
Key architectural components include a Cube Build Engine (supporting MapReduce, Spark, Flink), a REST Server exposing API/JDBC/ODBC interfaces, and HBase as the column‑oriented storage layer.
User and permission management uses Spring Security with three authentication modes (testing, LDAP, SAML), allowing the BI system to share Kylin’s access controls.
Data modeling follows a drag‑and‑drop approach, mapping BI data models to Kylin cubes, supporting incremental builds and optional in‑memory snapshots for dimension tables under 300 MB.
Cube configuration and monitoring features include unified UI styling, permission integration, build engine selection (default Flink), status tracking (Disabled, Error, Ready), and operations such as Resume, Discard, Build, Refresh, and Merge.
After a cube reaches the READY state, standard SQL SELECT statements can query it, provided the query’s GROUP BY and WHERE columns match the cube’s defined dimensions and measures.
Kylin offers flexible connectivity via REST API, JDBC, and ODBC, and its plugin architecture ensures extensible storage integration, maintaining high scalability on HBase for datasets exceeding billions of rows.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.