Big Data 16 min read

Apache Kylin Principles, Architecture, and Real-World Applications in Baidu Maps, Lianjia, and Didi

This article explains Apache Kylin’s core principles and technical architecture, then details how major Chinese companies such as Baidu Maps, Lianjia, and Didi have deployed Kylin for large‑scale OLAP, describing their system designs, performance results, and the challenges they encountered.

Big Data Technology & Architecture

Dec 19, 2020

Apache Kylin Principles, Architecture, and Real-World Applications in Baidu Maps, Lianjia, and Didi

Recently I applied Apache Kylin in my work, so I investigated its principles and industry applications. This article references the official documentation and numerous company case studies, providing sources at the end for readers.

Apache Kylin Principles and Technical Architecture

Apache Kylin reads source data from Hive, uses MapReduce as the cube‑building engine, stores pre‑computed results in HBase, and exposes query interfaces via REST API, JDBC, and ODBC.

The system consists of two main parts: online query and offline cube construction, as shown in the architecture diagram below.

Apache Kylin in Baidu Maps

Baidu Maps’ data intelligence team was one of the earliest adopters of Kylin in China. Their OLAP platform runs about 80 cubes covering roughly 50 billion rows of historical data (half a year), with single‑table sizes up to 2 billion rows. Complex multi‑dimensional queries return results in milliseconds, efficiently handling billion‑scale data interaction.

Kylin solves three main pain points:

Dynamic multi‑dimensional metric calculation on hundred‑billion‑row data is accelerated by pre‑computing cubes stored in HBase.

Complex filter conditions are handled by Kylin’s router algorithm and optimized HBase coprocessors.

Large time‑range queries (monthly, quarterly, yearly) are addressed by cube data segment partitioning.

Baidu Maps OLAP Platform Architecture

Main modules include:

Data Ingestion – pulls fine‑grained fact tables from the data warehouse.

Task Management – executes and manages cube‑related jobs.

Task Monitoring – tracks the status of cube jobs.

Cluster Monitoring – monitors Hadoop ecosystem processes and Kylin processes.

Best practices such as limiting cube dimensions to 15, placing high‑cardinality dimensions first, and avoiding overly large dimension values are followed.

In practice, Baidu Maps builds cubes on fact tables while keeping dimension names in MySQL, storing only dimension IDs in the fact table to reduce size, lower join pressure on a small Hadoop cluster, and simplify back‑tracking.

Aggregation cubes help avoid data explosion when rolling up high‑dimensional metrics, and a dedicated “agg” partition stores pre‑aggregated results for fast single‑dimension queries.

For retention analysis, a rotating matrix approach stores daily retention data in a way that allows O(1) updates for new retention windows, reducing storage and compute overhead.

Apache Kylin in Lianjia

Lianjia’s Kylin platform (built in late 2016) runs on six machines: three for distributed cube building and three for load‑balanced query serving, each query node limited to 80 GB RAM. An independent HBase cluster is used to avoid interference with the compute cluster.

Kylin focuses on pre‑computation; ad‑hoc and detail queries are routed to Spark, Presto, or Hive via a custom QE engine.

Key statistics: over 500 cubes covering 12 business lines, total storage >200 TB, billions of rows per cube, daily query volume >270 k with 70 % of queries under 500 ms.

Version 1.6 is in production, with selective adoption of 2.0+ features such as distributed building and distributed locks.

Custom enhancements include:

Distributed cube building across multiple machines.

Optimized dictionary download to fetch only needed dictionaries.

Global dictionary lock fix to prevent task blockage.

Forced dimension‑table joins to filter dirty data.

Switch to G1 garbage collector for more predictable GC pauses.

Kylin in Didi’s OLAP Engine

Didi’s deployment consists of 2 build nodes and 8 query nodes (2 serving REST requests, the rest as standby). The design avoids a single point of failure and mitigates metadata synchronization issues.

Kylin serves as a “solidified analysis” engine, providing aggregation cache for tables identified by the OLAP engine as high‑frequency aggregation targets, such as reporting tables and large‑scale tables selected by the data‑transfer decision module.

Current metrics: >700 cubes, >2000 daily build jobs (average 37 min), total cube storage >30 TB, 80 % of uncached queries under 500 ms, 90 % under 2.8 s. Peak cube count can exceed 1100 during heavy periods.

Business impact includes supporting internal reporting platforms and open data services across more than ten business lines, serving over 3000 indirect users with stable, high‑performance analysis.

Challenges encountered:

High concurrency from the OLAP engine caused occasional metadata update failures; mitigated with internal queuing and retry mechanisms.

Cube deletion did not automatically clean HBase segments, leading to storage pressure; custom scripts were added to perform cleanup.

These issues have been addressed in newer Kylin releases.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Warehouse OLAP Distributed Computing Apache Kylin Cube

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.