Big Data 21 min read

How Baidu Maps Scales Billion‑Row OLAP Queries with Apache Kylin

Baidu Maps’ Data Intelligence team built a large‑scale OLAP platform using Apache Kylin, detailing the challenges of multi‑dimensional analysis on billions of rows, the architecture, custom extensions for task, resource, and monitoring management, and performance optimizations that achieve millisecond‑level SQL responses.

Baidu Maps Tech Team

Jan 6, 2016

How Baidu Maps Scales Billion‑Row OLAP Queries with Apache Kylin

Introduction

The Baidu Maps Open Platform Business Unit’s Data Intelligence team is responsible for massive daily data processing (hundreds of billions of rows) and provides millisecond‑level OLAP query services for various business scenarios.

Why Apache Kylin?

In 2014 the team needed a complete OLAP analysis platform. After evaluating Apache Drill, Presto, Impala, Spark SQL and Apache Kylin, they chose Kylin because it pre‑computes cubes via MapReduce, offering low‑latency queries on petabyte‑scale data. The first production deployment occurred around February 2015.

Apache Kylin Overview

Apache Kylin is an open‑source distributed analytical engine that provides a SQL interface and multi‑dimensional (OLAP) capabilities on top of Hadoop. Originally developed by eBay, it became an Apache top‑level project in November 2015.

Key Challenges Solved by Kylin

Pain point 1: Dynamic calculation of multi‑dimensional metrics on hundred‑billion‑row data; Kylin solves this by pre‑computing cubes stored in HBase.

Pain point 2: Complex conditional filtering; Kylin uses a router algorithm and optimized HBase coprocessors.

Pain point 3: Queries across large time ranges (months, quarters, years); Kylin manages this with cube data segment partitioning.

These solutions enable millisecond‑level responses for pages that may issue multiple SQL queries, turning an otherwise 10‑second load time into an acceptable experience.

Platform Architecture

The OLAP platform consists of the following core modules:

Data Ingestion: Pulls fine‑grained fact tables from the data warehouse.

Task Management: Executes and manages cube‑related jobs.

Task Monitoring: Tracks job status and sends alerts on failure or completion.

Cluster Monitoring: Monitors Hadoop, Hive, HBase, and Kylin processes and handles temporary file cleanup.

Secondary Development

Data Ingestion Enhancements

The platform supports MySQL and HDFS sources. It detects when previous‑day data is ready by checking Hive partitions or monitoring row‑count changes in MySQL, then triggers data pull, optional preprocessing, and cube build.

Task Management Extensions

Cubes are built from data segment units (similar to Hive partitions). Three operations exist: build, refresh, merge. The team introduced strategies to refresh or merge segments efficiently, especially when dealing with many small segments across a month.

For example, refreshing an entire month’s data can be reduced from 23 separate refreshes (≈537 minutes) to a single merge followed by one refresh (≈84 minutes).

Resource Isolation

To avoid a single Hadoop queue for all projects, the team modified Kylin‑1.1.1 source code to support per‑project queue isolation, submitting the change as KYLIN‑1241 .

Hadoop & HBase Optimizations

Due to limited hardware, the team tuned YARN and MapReduce memory settings, and performed extensive HBase JVM, ZooKeeper, and region server tuning (GC parameters, ms‑lab, session timeouts, CMS GC, etc.).

Retention Analysis Solutions

The team compared a traditional storage scheme (requiring daily back‑fills of historical data) with a “rotated diagonal” approach that stores only the most recent day’s data and derives older retention metrics by shifting indices. The latter reduces daily MapReduce workload and simplifies scaling to longer retention windows.

Summary

Today the platform supports around 80 cubes, half a trillion rows of source data, and delivers millisecond‑level responses for complex, multi‑dimensional queries across large time ranges. The team credits the open‑source contributions of Apache Kylin and the community for enabling this scalable OLAP solution.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Big Data data-warehouse OLAP Hadoop Apache Kylin

Written by

Baidu Maps Tech Team

Want to see the Baidu Maps team's technical insights, learn how top engineers tackle tough problems, or join the team? Follow the Baidu Maps Tech Team to get the answers you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.