How Druid Uses Bitmap Indexes for Fast Queries and Precise Deduplication
This article explains how Apache Druid builds and queries bitmap indexes for efficient OLAP analysis, and describes a dictionary‑encoding plus bitmap solution—adapted from Kuaishou—to achieve exact deduplication even on high‑cardinality dimensions.
Druid Bitmap Index
Bitmap indexes are widely used in many big‑data OLAP engines such as Druid, Kylin, and Doris. In Huolala’s data platform, Druid powers analysis scenarios like Compass and A/B testing. This article shows how to construct bitmap indexes in Druid for fast queries and how to leverage bitmap’s deduplication capability.
Dimension Dictionary
When building a bitmap index, Druid first creates a dimension dictionary for each dimension column. All distinct values are collected, sorted, and stored in an array; each value receives an integer code equal to its array index.
Building the Bitmap Index
For every distinct value in the dictionary, Druid generates a bitmap where a bit set to 1 indicates that the corresponding row contains that value. The following table illustrates the source data and the resulting bitmap construction process.
rowNum
city_id
route_type
pay_type
count
0
1001
1
2
5
1
1001
2
3
15
2
1002
1
1
22
3
1003
2
1
20
The construction steps are:
Assign an auto‑incrementing rowNum to each row.
Traverse each dimension column and build the corresponding bitmap array.
The image shows the dictionary for city_id and the three‑element bitmap array representing values 1001, 1002, and 1003. For example, bitmaps[0] has bits 0 and 1 set because city_id=1001 appears in rows 0 and 1.
How Queries Use the Index
A simplified query example demonstrates how Druid uses the bitmap structures to answer a request.
select sum(count) from table where city_id = 1001 and route_type = 2The actual Druid implementation involves additional complexities such as index storage and fast dictionary lookup, which are beyond the scope of this article.
Druid Precise Deduplication
Exact deduplication on massive datasets is challenging because precision and low latency often conflict. In Huolala’s OLAP queries—e.g., A/B experiments that need driver‑level deduplication—Druid’s native support is insufficient, so a solution from Kuaishou was integrated.
Dictionary‑Encoding + Bitmap Solution
The approach adds a unique metric that stores a global integer dictionary encoding of the target column. During ingestion, the encoded values are stored as bitmaps; query time performs bitmap intersections to obtain the exact deduplication result. The bitmap space (up to 4.2 billion bits) occupies about 500 MB before compression, and compression dramatically reduces the footprint.
Dictionary Encoding
The encoding uses Apache Kylin’s AppendTrie, a high‑performance trie‑based dictionary. Nodes store a stable mapping ID (instead of position) to allow incremental additions without changing existing IDs. The trie is split into multiple sub‑trees to limit memory usage, and an LRU policy controls loading and eviction.
Concurrent Dictionary Construction
MVCC is used to version each persisted dictionary on HDFS; updates create a new working copy and later become a new version, with automatic version eviction.
Zookeeper distributed locks (keyed by data source and field name) prevent simultaneous writes to temporary directories during dictionary building.
Precise Deduplication Workflow
DetermineConfigurationJob calculates the number of segments and sets the reducer count for the next job.
The second job builds the global dictionary: each mapper sends rows of the same dedup column to a reducer, which creates the dictionary and stores it on HDFS, using Zookeeper locks to avoid conflicts.
The third job ingests data: mappers load the dictionary, encode the dedup column to integers, reducers aggregate the integers into bitmaps, and write a segment.
This pipeline preserves full data detail while keeping storage overhead low, enabling efficient exact deduplication.
High‑Cardinality Integer Deduplication Refactor
For driver‑level deduplication, the original dictionary‑based method caused excessive data expansion and long offline import times because of the massive cardinality of driver IDs. The refactor skips global dictionary construction and stores raw integer IDs directly as bitmaps.
Support ingestion of bitmap‑encoded strings from upstream; the data remains a bitmap after aggregation.
Support ingestion of integer array types; upstream converts driver IDs to a bitmap string, avoiding data splitting and dictionary lookup.
Benefits: the end‑to‑end pipeline time dropped from over two hours to about 40 minutes.
Drawbacks: raw integer bitmaps consume more disk space than encoded integers, and the solution currently only handles integer‑type high‑cardinality fields.
Conclusion
The article described how Apache Druid builds bitmap indexes and dimension dictionaries for efficient querying, and presented a Kuaishou‑inspired dictionary‑encoding plus bitmap solution for precise deduplication, including custom enhancements to support high‑cardinality integer dimensions.
Author: Zhang Fang, Senior Big‑Data Engineer at Huolala.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
