How Baidu Maps Scales Billion‑Row OLAP Queries with Apache Kylin
Baidu Maps’ Data Intelligence team built a large‑scale OLAP platform using Apache Kylin, detailing the challenges of multi‑dimensional analysis on billions of rows, the architecture, custom extensions for task, resource, and monitoring management, and performance optimizations that achieve millisecond‑level SQL responses.
Introduction
The Baidu Maps Open Platform Business Unit’s Data Intelligence team is responsible for massive daily data processing (hundreds of billions of rows) and provides millisecond‑level OLAP query services for various business scenarios.
Why Apache Kylin?
In 2014 the team needed a complete OLAP analysis platform. After evaluating Apache Drill, Presto, Impala, Spark SQL and Apache Kylin, they chose Kylin because it pre‑computes cubes via MapReduce, offering low‑latency queries on petabyte‑scale data. The first production deployment occurred around February 2015.
Apache Kylin Overview
Apache Kylin is an open‑source distributed analytical engine that provides a SQL interface and multi‑dimensional (OLAP) capabilities on top of Hadoop. Originally developed by eBay, it became an Apache top‑level project in November 2015.
Key Challenges Solved by Kylin
Pain point 1: Dynamic calculation of multi‑dimensional metrics on hundred‑billion‑row data; Kylin solves this by pre‑computing cubes stored in HBase.
Pain point 2: Complex conditional filtering; Kylin uses a router algorithm and optimized HBase coprocessors.
Pain point 3: Queries across large time ranges (months, quarters, years); Kylin manages this with cube data segment partitioning.
These solutions enable millisecond‑level responses for pages that may issue multiple SQL queries, turning an otherwise 10‑second load time into an acceptable experience.
Platform Architecture
The OLAP platform consists of the following core modules:
Data Ingestion: Pulls fine‑grained fact tables from the data warehouse.
Task Management: Executes and manages cube‑related jobs.
Task Monitoring: Tracks job status and sends alerts on failure or completion.
Cluster Monitoring: Monitors Hadoop, Hive, HBase, and Kylin processes and handles temporary file cleanup.
Secondary Development
Data Ingestion Enhancements
The platform supports MySQL and HDFS sources. It detects when previous‑day data is ready by checking Hive partitions or monitoring row‑count changes in MySQL, then triggers data pull, optional preprocessing, and cube build.
Task Management Extensions
Cubes are built from data segment units (similar to Hive partitions). Three operations exist: build, refresh, merge. The team introduced strategies to refresh or merge segments efficiently, especially when dealing with many small segments across a month.
For example, refreshing an entire month’s data can be reduced from 23 separate refreshes (≈537 minutes) to a single merge followed by one refresh (≈84 minutes).
Resource Isolation
To avoid a single Hadoop queue for all projects, the team modified Kylin‑1.1.1 source code to support per‑project queue isolation, submitting the change as KYLIN‑1241 .
Hadoop & HBase Optimizations
Due to limited hardware, the team tuned YARN and MapReduce memory settings, and performed extensive HBase JVM, ZooKeeper, and region server tuning (GC parameters, ms‑lab, session timeouts, CMS GC, etc.).
Retention Analysis Solutions
The team compared a traditional storage scheme (requiring daily back‑fills of historical data) with a “rotated diagonal” approach that stores only the most recent day’s data and derives older retention metrics by shifting indices. The latter reduces daily MapReduce workload and simplifies scaling to longer retention windows.
Summary
Today the platform supports around 80 cubes, half a trillion rows of source data, and delivers millisecond‑level responses for complex, multi‑dimensional queries across large time ranges. The team credits the open‑source contributions of Apache Kylin and the community for enabling this scalable OLAP solution.
Baidu Maps Tech Team
Want to see the Baidu Maps team's technical insights, learn how top engineers tackle tough problems, or join the team? Follow the Baidu Maps Tech Team to get the answers you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
