How Apache Kylin Supercharged OLAP at ZTO Express: A Deep Dive
This article details ZTO Express's journey of adopting Apache Kylin for OLAP, comparing it with Presto, describing platform architecture, performance gains, integration with scheduling and monitoring systems, and the practical optimizations and future plans that enabled sub‑second query responses on massive daily data volumes.
1. OLAP Evolution at ZTO Express
On October 17, senior data engineer Wang Chenglong presented the practice of Apache Kylin at ZTO Express, explaining how the company built OLAP capabilities to handle over 2 billion orders and 20 TB of daily data.
1.1 Platform Architecture
The big‑data platform consists of multiple components, with Kylin and Presto as the primary OLAP engines. The rightmost part of the diagram shows monitoring systems for each component, while the top layer includes scheduling and ad‑hoc query tools.
1.2 OLAP Development Timeline
Before 2017, Impala was the main engine, offering fast queries and Hive compatibility but suffering from high memory usage, instability, and C++ stack maintenance costs. In 2017, Presto was introduced, later accelerated with Alluxio, providing stability, high performance, and rich data source support, yet requiring large clusters and suffering from repeated data scans.
1.3 Why Choose Apache Kylin
Kylin was adopted in 2018 to address Presto’s trade‑offs. Its advantages include standard SQL support, sub‑second query speed, query performance independent of fact table size, low cluster requirements, and stable operation.
2. Apache Kylin Overview
Kylin is an open‑source, distributed analytical data warehouse that provides SQL interfaces and multi‑dimensional analysis on Hadoop/Spark, delivering sub‑second query latency on massive tables.
Pre‑computation: trades storage for query speed.
High performance: >97% of queries return within 1 s.
Scalable: horizontal scaling improves throughput.
Easy integration: JDBC/REST APIs.
Its drawbacks are a higher learning curve for cube optimization and suitability only for fixed‑schema analyses.
3. Practical Implementation
3.1 Business Case: Routing Volume Analysis
The report required aggregating dozens of dimensions (origin province, hubs, weight ranges, etc.) over billions of daily records, with a 5‑second SLA. Presto queries took 20‑60 s, far exceeding the requirement.
3.2 Kylin Empowerment
Using Kylin’s JDBC, the routing analysis query was reduced to 2.9 s on first run and sub‑second thereafter via caching, fully meeting the SLA.
3.3 Scale and Performance
Kylin runs on 5 nodes (1 job, 4 query) alongside a 40‑node HBase cluster. It manages 63 cubes totaling over 33 TB, with more than 800 billion source rows and >10 k daily queries, of which >97% finish within 1 s.
3.4 Integration with Scheduling System
Kylin’s REST APIs allow cube build, rebuild, and kill operations to be managed by the internal scheduler, enabling dependency handling, automatic retries, and alerting via phone or DingTalk.
3.5 Monitoring System
A custom monitoring solution tracks minute‑level query volume, failures, and automatically kills abnormal SQL to protect the cluster. Day‑level metrics include total queries, slow‑SQL TOP N, and per‑application query share. Anomaly alerts cover cube bloat, segment gaps, job failures, missing TTL, and process health.
3.6 Optimization Practices
HBase compression (snappy) and timeout/retry tuning reduced storage by ~70% and improved stability. MapReduce parameters were increased (e.g., max reducers, mapper input rows) cutting build times by up to one‑third.
3.7 Data Management
Regular cleanup of metadata, temporary cube data, and expired HBase tables, plus periodic backups, keep the system healthy.
3.8 Source Code and Upgrades
Minor source modifications added Kafka publishing of query info and a distributed lock for dictionary updates. An upgrade from Kylin 2.5.1 to 3.0.2 was completed in July.
4. Future Plans
Intelligent diagnosis to supplement monitoring.
Query push‑down to Presto for non‑cube queries, creating a unified query layer.
Self‑service analytics platform leveraging Kylin.
Zhongtong Tech
Integrating industry and information for digital efficiency, advancing Zhongtong Express's high-quality development through digitalization. This is the public channel of Zhongtong's tech team, delivering internal tech insights, product news, job openings, and event updates. Stay tuned!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
