Optimizing Apache Kylin for Meituan's Sales OLAP: From MapReduce to Spark and Resource Tuning
This article presents a detailed case study of how Meituan's in‑store dining sales team identified severe efficiency issues in their Apache Kylin‑based OLAP system, dissected the construction process, and applied a step‑by‑step optimization roadmap—including engine migration, dimension pruning, resource configuration, and Spark‑based layered building—to boost query performance and achieve near‑perfect SLA.
Background
Since 2016 Meituan's in‑store dining sales platform ("Qingtian") has used Apache Kylin as its OLAP engine. Rapid business growth by 2020 caused severe construction and query inefficiencies, threatening data‑driven decision making.
Problem & Goals
The sales system required multi‑level organization views, precise deduplication for over one‑third of metrics, and peak query loads of tens of thousands. Kylin’s 2^N dimension‑combination explosion and reliance on MapReduce led to long build times, high resource consumption, and SLA shortfalls.
Optimization Principles – Understanding the Fundamentals
Kylin’s pre‑computation creates Cuboids for every dimension combination; queries read the appropriate Cuboid. The By‑layer algorithm computes Cuboids layer‑by‑layer, reusing results from lower layers to avoid redundant work.
Process Analysis – Layered Decomposition
The team broke the build pipeline into five key stages: engine selection, data reading, dictionary building, layered construction, and file conversion. Detailed analysis of each stage revealed specific bottlenecks.
Engine Selection
Switching the build engine from MapReduce to Spark (supported by Kylin since 2017) increased build speed by 1‑3×. The migration was performed gradually, preserving existing MapReduce jobs while optimizing parameters.
Data Reading
Kylin reads source data from Hive external tables stored in HDFS. Small‑file issues were mitigated by adjusting MapReduce split size and merging Hive partitions where appropriate.
Dictionary Building
Dimension dictionaries map raw values to encoded IDs, reducing HBase storage. Global dictionary dependencies were configured to avoid redundant computation for deduplication‑heavy metrics.
Layered Build
With Spark, the By‑layer algorithm is used exclusively. Each Cuboid layer becomes a Spark job; intermediate results are cached in memory, eliminating repeated reads. The number of jobs equals the number of layers, and each job contains two stages (read cache, write cache).
Resource Configuration
Dynamic resource allocation was tuned: each executor provides 1 CPU, 6 GB heap, and 1 GB off‑heap memory. Task parallelism was calculated as CPU = kylin.engine.spark-conf.spark.executor.cores * number_of_executors . Memory per task follows Memory = (executor_memory + overhead) * executors . Example: CPU: 1*1000=1000; Memory: (6+1)*1000=7000 GB .
File Conversion
After build, Cuboid files are bulk‑loaded into HBase as HFiles via a MapReduce job. The number of map tasks equals the number of output files from the layered build stage, so resource requests were aligned accordingly.
Implementation Roadmap – From Point to Plane
A pilot on the core sales transaction task demonstrated that the combined optimizations reduced daily build time from over two hours to under ten minutes and raised SLA achievement from 90 % to 99.99 %.
Results
Resource consumption per task decreased dramatically, and overall cluster CU usage fell while maintaining throughput. By June 2020 the SLA hit 100 %.
Outlook
Kylin graduated to an Apache top‑level project in 2015 and continues evolving; Meituan now runs a stable V2.0 version and has begun testing the V3.1 release that replaces Spark with Flink for the build engine, further improving performance.
Author Bio
Yue Qing, Engineer at Meituan’s in‑store dining R&D center since 2019.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.