Druid OLAP Engine Architecture, Practices, and Optimizations at Beike
This article describes Beike's adoption of the Druid OLAP engine, detailing its technical architecture, key data structures, offline and real‑time optimization techniques, cluster load‑protection strategies, and the platformization roadmap for scalable analytics across multiple business lines.
Background – Beike, a leading online real‑estate platform, generates massive real‑time and offline data; traditional single‑technology solutions cannot meet diverse query needs, so Druid was selected for its flexibility, high concurrency, and low operational cost.
2. Druid Technical Architecture and Key Technologies
The architecture consists of four core node types: Master Servers (Coordinator & Overlord), Query Servers (Broker & optional Router), Data Servers (Historical & MiddleManager), and External Dependencies (Deep Storage, Metadata Storage, Zookeeper). Deep Storage provides shared object storage (e.g., S3, HDFS) for segment backups; Metadata Storage holds system metadata in an RDBMS; Zookeeper handles service discovery and leader election.
Druid leverages pre‑aggregation, bitmap encoding, and inverted indexes to achieve high‑performance queries. A sample dataset with fields house_type, city_name, look_cnt, and trade_cnt illustrates the dimension index structure.
3. Practice and Optimization
3.1 Offline Scenario – The current cluster runs on 20+ nodes, handling >30 million daily queries with peak QPS >1,200. Data ingestion uses Hive tables (full and incremental) and Hadoop index jobs. Optimizations include partition‑based Parquet file generation, increasing numShards, and repartitioning large tables to reduce import time (e.g., 1 billion rows split into 20 partitions with 5 shards each).
Shards merging (compact) is applied when shard size deviates from the recommended 300‑700 MB, keeping each segment under 5 million rows to balance CPU usage across Historical nodes.
Engineering improvements automate these steps, as shown in the “Accelerated Offline Data Ingestion” diagram.
3.1.2 Cluster Load‑Protection – Three mechanisms are employed:
Query cache (API cache, queryEngine cache, Druid segment cache) to reduce QPS.
Dynamic throttling: Broker monitors Historical CPU load and applies tiered rate‑limiting based on business importance.
Timeout control: Default 5 min timeout is reduced to ~15 s; a custom QueryResource implementation adjusts timeout based on CPU load using asynchronous threads.
Monitoring graphs demonstrate CPU load spikes during peak hours and the effect of the above protections.
3.2 Real‑time Scenario – Precise deduplication is required for metrics such as GMV and GSV. Druid’s native API lacks this, so a Redis‑backed dictionary is introduced. Two approaches were evaluated: a global auto‑increment ID (rejected due to Redis QPS pressure) and Snowflake‑generated IDs (used locally by each Kafka index job).
CommonUnique metric type replaces the 32‑bit RoaringBitmap with a 64‑bit Roaring64NavigableMap to support Snowflake IDs. Key classes added: CommonUniqueComplexMetricSerde – registers the new metric and encodes values into Redis. CommonUniqueAggregator – implements aggregation logic. CommonUniqueAggregatorFactory – creates the aggregator.
Configuration examples for metricsSpec and a sample query using commonUnique are provided.
3.3 Platformization – A unified OLAP platform (vili) abstracts engine specifics, offering six capabilities: data‑warehouse modeling, cube modeling, metric definition, metric processing, unified query API, and tooling for metadata refresh, data back‑fill, and cross‑engine comparison. The overall system architecture is illustrated in the “OLAP System Overall Architecture” diagram.
4. Future Plans
Deepen real‑time analytics on Druid, potentially adding update capabilities.
Accelerate offline ingestion by exploring Spark‑based pipelines and parallel index imports with global dictionary support.
Optimize query performance for high‑cardinality dimensions via pre‑aggregated sub‑indexes.
Further separate engine‑specific functions into the platform layer, centralize metadata, task, and log management, and enforce API rate‑limiting at the platform level.
Beike Product & Technology
As Beike's official product and technology account, we are committed to building a platform for sharing Beike's product and technology insights, targeting internet/O2O developers and product professionals. We share high-quality original articles, tech salon events, and recruitment information weekly. Welcome to follow us.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
