Performance Optimization of Apache Kylin at Beike: HBase Tuning, Region Management, and Slow‑Query Mitigation
This article details how Beike's engineering team scaled Apache Kylin to handle tens of millions of daily queries by optimizing HBase configurations, reducing region count, improving data locality, addressing IO and JVM GC bottlenecks, and implementing comprehensive slow‑query detection and active‑defense mechanisms.
Since 2017, Beike has deployed Apache Kylin as a company‑wide OLAP engine, operating over 100 Kylin instances, 800+ cubes, and more than 300 TB of single‑replica storage, with daily query volumes exceeding 20 million.
Problem 1 – Table/Region Inaccessibility: Critical tables sometimes had regions that became unreachable, causing cube builds to fail and queries to time out. The cluster also accumulated over 160 k regions, making table creation and deletion extremely slow.
Solution: Deleted unused tables to reduce region count, increased cleanup frequency from weekly to daily, merged cubes weekly, upgraded HBase from 1.2.6 to 1.4.9 to use RSGroup for isolation, disabled automatic region balancing during low‑traffic periods, enabled Canary checks for region health, and isolated meta‑related tables with RSGroup.
Problem 2 – Low RegionServer Data Locality: Only ~20 % of region data could leverage HDFS short‑circuit reads, degrading query latency.
Solution: Modified HFileOutputFormat3 to write a replica of each HFile to the DataNode hosting the RegionServer (based on HBASE‑12596), raising locality to over 80 % and improving stability.
Problem 3 – RegionServer IO Bottleneck: During peak build periods, high IO wait on RegionServers caused P99 response times to spike, especially when large time‑range builds saturated network bandwidth.
Solution: Redirected build output to the larger Hadoop cluster, then used a throttled DistCp to copy HFiles into the HBase HDFS cluster before bulk‑loading, applying this path only to high‑volume cubes.
Problem 4 – Slow‑Query Chain Length: Diagnosing timeouts required traversing logs across Kylin, HBase, Elasticsearch, and MySQL, leading to long resolution cycles.
Solution: Added cube and region identifiers to HBase logs via a Protobuf field and interceptor, enabling immediate alerts with cube and region details; integrated alerts into enterprise WeChat for rapid response.
Problem 5 – RegionServer Queue Backlog: Queue buildup caused P99 response times of over ten minutes, with some queries running for half an hour.
Solution: Identified top‑10 longest queries, linked them to cube RowKey mismatches, and adjusted cube designs; explored pre‑execution SQL scoring to reject risky queries.
Problem 6 – Active Defense for Slow Queries: High‑latency queries occupied request queues, affecting other workloads.
Solution: Collected Kylin logs via Kafka, cleaned them in real‑time, stored in Druid, and automatically reduced timeout thresholds for offending cubes when query latency exceeded thresholds.
Problem 7 – Critical Metric Query Performance: Storing all data on HDD caused contention for high‑priority metrics.
Solution: Deployed SSDs for selected DataNodes and configured specific cubes to use SSD storage paths, achieving 40 % latency reduction for >100 k row scans and 20 % for >1 M row scans.
Problem 8 – JVM GC Pauses: Frequent GC pauses on RegionServers threatened the 99.7 % sub‑3‑second query SLA.
Solution: Upgraded JDK from 1.8 (G1 GC) to JDK 13 and switched to ZGC, which reduced pause counts to near zero and dramatically lowered GC times.
Overall, the combination of region management, data locality improvements, IO throttling, enhanced logging, active defense, storage tiering, and JVM tuning enabled Beike to sustain high query throughput while meeting strict latency SLAs.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.