Beike's OLAP Platform: Druid Adoption, Architecture, Performance Comparison, and Operational Optimizations
This article details Beike's large‑scale OLAP platform, explaining why Druid was chosen over Kylin, describing the platform's four‑layer architecture, presenting performance and storage benchmarks, and outlining practical improvements to data ingestion, real‑time distinct counting, and cluster stability for high‑concurrency business scenarios.
Beike, a leading online real‑estate service, built an OLAP platform to serve agents, operations staff, analysts, and customers, consisting of four layers: an application layer (dashboards and reports), a metric layer (modeling, jobs, metric definitions, API output), a routing layer (query translation, caching, degradation, engine switching), and an OLAP engine layer (Kylin, Druid, ClickHouse).
Initially relying on Kylin, Beike introduced Druid in May 2020 because Kylin suffered from long build times, high storage overhead, limited query flexibility, massive data inflation, and steep tuning requirements. Druid now handles about 60% of query traffic.
Selection criteria for an OLAP engine included petabyte‑scale data, sub‑second latency, high QPS (average 500‑600, peak 2000), flexible query interfaces, and fast data import. Compared with Kylin, Druid offers comparable concurrency, native SQL distinct support, and lower storage inflation (1‑3× vs. 18‑100× for Kylin).
Performance tests showed Druid's data import time is roughly one‑third of Kylin's, query latency is similar at 200 QPS, and Druid’s HDFS storage usage is far lower. Druid also achieves >99.9% three‑second success rate versus ~99.3% for Kylin.
The Druid architecture comprises four layers: a broker query service, a storage layer (hot SSD for recent six months, cold HDD for older data), a cluster management layer (overlord and coordinator), and a data ingestion layer handling both batch and real‑time jobs.
Metric construction is performed through a one‑stop platform that creates models and cubes (similar to Kylin), triggers Hive‑to‑Druid jobs, and supports various time granularities. The platform enables both API‑driven and dashboard‑driven queries.
Operational improvements include:
Optimized batch ingestion using Hadoop index jobs, reducing import time to one‑third by adjusting partitioning, shard numbers, and memory settings.
Added multi‑column real‑time distinct counting (CommonUnique) via Snowflake‑style IDs, Redis‑backed dictionaries, and 64‑bit bitmap aggregation.
Cluster stability measures: query result caching at the API/SQL layer, dynamic throttling based on historical node CPU load and business‑line priority, and deep‑storage optimizations such as segment merging, compact tasks, and limiting small‑file proliferation.
Future plans focus on expanding real‑time metrics to Druid, exploring Spark‑based and parallel ingestion jobs, and using Hive for global dictionaries to improve high‑cardinality distinct counting.
The presentation concludes with thanks and an invitation to join the DataFunTalk community.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
