Boost ClickHouse Bitmap Queries 10x with BitBooster: Techniques & Results
This article explains how the BitBooster suite accelerates ClickHouse bitmap (BitMap) queries by up to tenfold, covering background, performance bottlenecks, single‑node and read optimizations, layout and instruction‑set enhancements, encoding dictionaries, multi‑node scaling, and real‑world benchmark results.
Recommendation systems demand strong personalization, driving the creation of sub‑second analytical engines. WeChat's big‑data platform built the WeOLAP sub‑second query engine on ClickHouse and introduced a series of performance optimizations called BitBooster.
1. Background
To achieve second‑level interactive analysis for scenarios such as audience estimation and retention analysis, Join operations are often replaced by BitMap calculations, yielding several‑fold speedups. In practice, ClickHouse’s BitMap operations sometimes degrade, causing severe tail‑latency.
BitBooster is the collective name for a set of optimizations added to Bitmap queries, continuously iterated. It can accelerate related analyses by up to 10×.
1.1 Performance First
Before deployment: Single‑group BitMap 64 merge took over 100 seconds.
After deployment: Real‑time access achieved, single‑group BitMap 64 merge took only 0.3 seconds, a performance increase of over 300×.
Note: The above case uses Bitmap uint64, where the effect is especially pronounced.
1.2 Business Scenarios
Audience Estimation: Filter ad‑delivery users, estimate hit counts, assist budgeting.
Behavior Analysis: Retention, differential analysis, etc.
Experiment Analysis: A/B testing before model rollout.
Red‑dot Delivery: Operational activities, impact assessment.
These online, highly interactive analyses typically require sub‑second response times. For audience estimation, users are selected based on attribute conditions, essentially performing fast set operations.
1.3 Why Native Bitmap Is Slow?
“RoaringBitmap is excellent in performance and space utilization and is widely used in mature open‑source software (Spark, Druid, Kylin).”
ClickHouse’s kernel uses RoaringBitmap as the underlying Bitmap structure, which is highly compressed. In many cases RoaringBitmap is fast, but ClickHouse’s BitMap operations can suffer from data skew, insufficient read parallelism, sparse storage layout, lack of vectorized execution, and under‑utilized cluster resources.
2. Key Technologies
2.1 Single‑Node Computation Optimization
ClickHouse parallelizes SQL execution via a DAG and a thread pool. Each thread traverses the DAG, executing any ready node. However, a DAG node can be processed by only one thread at a time, so a slow Pipe can become a bottleneck.
Because Bitmap data can be highly skewed, some Pipes handle far more BitMap elements than others, leading to data‑skew‑induced stragglers. To mitigate this, we added a Repartition stage to the DAG for Bitmap‑related queries, redistributing work among Pipes when certain conditions are met.
The Repartition step yields a 10‑20% performance gain in skewed scenarios.
-- Query S1 example
CREATE TABLE test.bm_64 (
ds Date,
bm AggregateFunction(groupBitmap, UInt64),
id String
) ENGINE = MergeTree
PARTITION BY ds
ORDER BY id;
INSERT INTO test.bm_64 SELECT '2022-06-10', groupBitmapState(rand64()), 'rand' FROM numbers(10000000);
SELECT ds, bitmapCardinality(bm) FROM test.bm_64 WHERE id = 'rand' SETTINGS force_repartition_after_reading_from_merge_tree = 1;2.2 Read Optimization
Profiling showed that two of three execution threads spent most of their time waiting, while one thread was busy reading BitMaps. ClickHouse reads data in Mark‑level units (default granularity 8192 rows). This coarse granularity limits parallelism for large BitMaps.
We introduced an asynchronous deserialization interface for large BitMaps, allowing other threads to handle element‑wise deserialization, which reduced thread wait time noticeably.
2.3 Layout Optimization
RoaringBitmap stores 32‑bit elements using a two‑level index (high 16 bits as key, low 16 bits in a Container). Sparse Containers (Array) lead to slower operations. For 64‑bit elements, an additional TreeMap layer makes the layout even sparser, causing severe performance degradation.
Encoding the BitMap into a compact sequential representation dramatically improves both storage and computation speed. We added an EncodedDictionary that integrates with ClickHouse’s existing dictionary framework, exposing a new function bitmapDictEncode to transform an unencoded BitMap into a dense encoded one.
2.4 Instruction‑Set Optimization
By enabling AVX2/AVX‑512 vectorization, bitmap logical operations (AND, OR, XOR, ANDNOT) gain 3.5‑5× speedups. Compilation flags such as -DENABLE_AVX2=1 and -DENABLE_AVX2_FOR_SPEC_OP=1 activate these optimizations.
2.5 Multi‑Node Parallelism & Elastic Scaling
Sharding data by grouping rules, then encoding BitMaps and assigning groups enables distributed computation, reducing intermediate data transfer. Combined with WeOLAP’s compute‑storage separation, clusters can elastically scale BitMap processing during peak loads.
3. Results
3.1 Encoding Throughput
Real‑time encoding of one million rows adds less than 1.8 seconds on cold start and under 20 ms after dictionary warm‑up.
3.2 Computation Performance
Benchmarks on the test dataset show Bitmap64 operations improving from 108 s to 26 s (≈4×) after all optimizations, while Bitmap32 achieves sub‑second latency.
3.3 Online Data Impact
In an article‑analysis scenario (Bitmap64, ~2 k BitMaps per group), single‑node acceleration matched the test results. In a retention‑analysis platform (Bitmap32, millions of BitMaps over 7‑14 days), similar speedups were observed.
4. Summary
Database performance tuning is a meticulous process; for a high‑performance engine like ClickHouse, iterative profiling, code review, and demos are essential. The most rewarding moments are the “Eureka” insights that finally explain the bottlenecks. Early ClickHouse deployments used Bitmap32 without noticeable issues; as business needs grew to UInt64‑based identifiers, Bitmap64 performance degraded, prompting the BitBooster optimizations described here. The WeOLAP team has contributed over 100 PRs to the ClickHouse community, positioning Tencent as a leading open‑source contributor.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
WeChat Backend Team
Official account of the WeChat backend development team, sharing their experience in large-scale distributed system development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
