Kuaishou Druid Platform Overview and Precise Deduplication Design
This article presents Kuaishou’s adoption of Apache Druid for massive real‑time analytics, explains why precise deduplication is required, details the platform’s architecture, the hashset and dictionary‑plus‑Bitmap deduplication designs, concurrency handling, performance optimizations, and outlines the future roadmap, providing practical insights for big‑data engineers.
Kuaishou selected Apache Druid as the core engine for its massive real‑time analytics platform because its business demands ultra‑large data volumes, millisecond‑level query latency, high concurrency, and flexible schemas. Native Druid lacks precise deduplication, which is essential for scenarios such as billing, prompting the need for custom solutions.
The platform ingests both real‑time Kafka streams and offline Hadoop data, indexes them via Kafka and Hadoop indexes, and stores them in Druid segments with business isolation and hot‑cold tiering. External interfaces include custom APIs, Kwai BI visualisation, and limited Tableau support. Supporting systems provide metric monitoring, query probing, and a customised management UI built on top of Druid’s JSON APIs.
Out‑of‑the‑box Druid offers only approximate distinct counts (cardinality agg, HyperUniques) or resource‑heavy exact group‑by, and the community DistinctCount plugin is limited to single dimensions and cannot span intervals.
Two precise deduplication schemes were evaluated. The first uses a hashset to store raw values, guaranteeing exactness but consuming up to ten times the raw data memory (e.g., 500 M strings can require ~5 GB). Distributed hashset or MapReduce‑style shuffling conflicted with Druid’s architecture, making it impractical.
The second scheme combines dictionary encoding with bitmap storage. Input strings are encoded to integers, stored as bitmaps, and combined via bitmap intersections. This approach reduces storage to ~500 MB for 4.2 billion distinct values and can be compressed further. Kylin’s experience inspired the final design.
Dictionary encoding is implemented with an AppendTrie tree model similar to Kylin’s. Global integer IDs are assigned per node, allowing incremental appends without ID changes. To control memory growth, large sub‑trees are split when a threshold is exceeded, and a Guava LoadingCache with LRU eviction lazily loads sub‑trees.
Concurrent dictionary construction uses MVCC persisted on HDFS and Zookeeper distributed locks keyed by datasource and column, ensuring only one process builds a given dictionary at a time.
Precise deduplication is exposed as a new unique metric. Implementation involves defining a ComplexMetricSerde for serialization, an Aggregator (and its BufferAggregator variant) for aggregation, and an AggregatorFactory for metric registration.
Performance optimisations include:
Increasing segment count (10 → 10 segments) reduced query time from 50 s to 7 s.
Switching to BatchOr bitmap union lowered time to 4 s.
Disabling gzip compression on broker‑historical communication cut time to 2 s.
Resource‑isolated deployment separates hot and cold workloads across proxies.
Materialised views (dimension and time‑series) dramatically improve query speed, with low storage inflation and high hit rates.
Historical fast‑restart using lazy segment loading reduced restart time from ~40 min to ~2 min.
Kafka indexing task auto‑scaling and memory‑based slot allocation saved >65 % memory for real‑time tasks and >87 % for offline tasks.
Metadata indexing (adding indexes on druid_segments) cut segment discovery from minutes to milliseconds.
The roadmap focuses on online‑high‑availability multi‑cluster deployment, enhanced usability through full SQL support, and kernel‑level performance work such as adaptive materialised views, numeric indexes, and a vectorised execution engine.
Community contributions include PR #7594 (precise deduplication) and PR #6988 (historical fast‑restart). The presenter, Deng Fangyuan, is a senior data‑platform engineer at Kuaishou with extensive experience in Hadoop, Kylin, and Druid.
Sample Druid segment descriptor (JSON) used in the platform:
{
"dataSource": "AD_active_user",
"interval": "2018-04-01T00:00:00.000+08:00/2018-04-02T00:00:00.000+08:00",
"version": "2018-04-01T00:04:07.022+08:00",
"loadSpec": {
"type": "hdfs",
"path": "/druid/segments/AD_active_user/20180401T000000.000+0800_20180402T000000.000+0800/2018-04-01T00_04_07.022+08_00/1/index.zip"
},
"dimensions": "appkey,spreadid,pkgid",
"metrics": "myMetrics,count,offsetHyperLogLog",
"shardSpec": {
"type": "numbered",
"partitionNum": 1,
"partitions": 0
},
"binaryVersion": 9,
"size": 168627,
"identifier": "AD_active_user_2018-04-01T00:00:00.000+08:00_2018-04-02T00:00:00.000+08:00_2018-04-01T00:04:07.022+08:00_1"
}Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
