Databases 23 min read

Boost ClickHouse Bitmap Queries 10x with BitBooster: Techniques & Results

This article explains how the BitBooster suite accelerates ClickHouse bitmap (BitMap) queries by up to tenfold, covering background, performance bottlenecks, single‑node and read optimizations, layout and instruction‑set enhancements, encoding dictionaries, multi‑node scaling, and real‑world benchmark results.

WeChat Backend Team
WeChat Backend Team
WeChat Backend Team
Boost ClickHouse Bitmap Queries 10x with BitBooster: Techniques & Results
Recommendation systems demand strong personalization, driving the creation of sub‑second analytical engines. WeChat's big‑data platform built the WeOLAP sub‑second query engine on ClickHouse and introduced a series of performance optimizations called BitBooster.

1. Background

To achieve second‑level interactive analysis for scenarios such as audience estimation and retention analysis, Join operations are often replaced by BitMap calculations, yielding several‑fold speedups. In practice, ClickHouse’s BitMap operations sometimes degrade, causing severe tail‑latency.

BitBooster is the collective name for a set of optimizations added to Bitmap queries, continuously iterated. It can accelerate related analyses by up to 10×.

1.1 Performance First

Before deployment: Single‑group BitMap 64 merge took over 100 seconds.

After deployment: Real‑time access achieved, single‑group BitMap 64 merge took only 0.3 seconds, a performance increase of over 300×.

Note: The above case uses Bitmap uint64, where the effect is especially pronounced.

1.2 Business Scenarios

Audience Estimation: Filter ad‑delivery users, estimate hit counts, assist budgeting.

Behavior Analysis: Retention, differential analysis, etc.

Experiment Analysis: A/B testing before model rollout.

Red‑dot Delivery: Operational activities, impact assessment.

These online, highly interactive analyses typically require sub‑second response times. For audience estimation, users are selected based on attribute conditions, essentially performing fast set operations.

1.3 Why Native Bitmap Is Slow?

“RoaringBitmap is excellent in performance and space utilization and is widely used in mature open‑source software (Spark, Druid, Kylin).”

ClickHouse’s kernel uses RoaringBitmap as the underlying Bitmap structure, which is highly compressed. In many cases RoaringBitmap is fast, but ClickHouse’s BitMap operations can suffer from data skew, insufficient read parallelism, sparse storage layout, lack of vectorized execution, and under‑utilized cluster resources.

2. Key Technologies

2.1 Single‑Node Computation Optimization

ClickHouse parallelizes SQL execution via a DAG and a thread pool. Each thread traverses the DAG, executing any ready node. However, a DAG node can be processed by only one thread at a time, so a slow Pipe can become a bottleneck.

Because Bitmap data can be highly skewed, some Pipes handle far more BitMap elements than others, leading to data‑skew‑induced stragglers. To mitigate this, we added a Repartition stage to the DAG for Bitmap‑related queries, redistributing work among Pipes when certain conditions are met.

The Repartition step yields a 10‑20% performance gain in skewed scenarios.

-- Query S1 example
CREATE TABLE test.bm_64 (
  ds Date,
  bm AggregateFunction(groupBitmap, UInt64),
  id String
) ENGINE = MergeTree
PARTITION BY ds
ORDER BY id;

INSERT INTO test.bm_64 SELECT '2022-06-10', groupBitmapState(rand64()), 'rand' FROM numbers(10000000);

SELECT ds, bitmapCardinality(bm) FROM test.bm_64 WHERE id = 'rand' SETTINGS force_repartition_after_reading_from_merge_tree = 1;

2.2 Read Optimization

Profiling showed that two of three execution threads spent most of their time waiting, while one thread was busy reading BitMaps. ClickHouse reads data in Mark‑level units (default granularity 8192 rows). This coarse granularity limits parallelism for large BitMaps.

We introduced an asynchronous deserialization interface for large BitMaps, allowing other threads to handle element‑wise deserialization, which reduced thread wait time noticeably.

2.3 Layout Optimization

RoaringBitmap stores 32‑bit elements using a two‑level index (high 16 bits as key, low 16 bits in a Container). Sparse Containers (Array) lead to slower operations. For 64‑bit elements, an additional TreeMap layer makes the layout even sparser, causing severe performance degradation.

Encoding the BitMap into a compact sequential representation dramatically improves both storage and computation speed. We added an EncodedDictionary that integrates with ClickHouse’s existing dictionary framework, exposing a new function bitmapDictEncode to transform an unencoded BitMap into a dense encoded one.

2.4 Instruction‑Set Optimization

By enabling AVX2/AVX‑512 vectorization, bitmap logical operations (AND, OR, XOR, ANDNOT) gain 3.5‑5× speedups. Compilation flags such as -DENABLE_AVX2=1 and -DENABLE_AVX2_FOR_SPEC_OP=1 activate these optimizations.

2.5 Multi‑Node Parallelism & Elastic Scaling

Sharding data by grouping rules, then encoding BitMaps and assigning groups enables distributed computation, reducing intermediate data transfer. Combined with WeOLAP’s compute‑storage separation, clusters can elastically scale BitMap processing during peak loads.

3. Results

3.1 Encoding Throughput

Real‑time encoding of one million rows adds less than 1.8 seconds on cold start and under 20 ms after dictionary warm‑up.

3.2 Computation Performance

Benchmarks on the test dataset show Bitmap64 operations improving from 108 s to 26 s (≈4×) after all optimizations, while Bitmap32 achieves sub‑second latency.

3.3 Online Data Impact

In an article‑analysis scenario (Bitmap64, ~2 k BitMaps per group), single‑node acceleration matched the test results. In a retention‑analysis platform (Bitmap32, millions of BitMaps over 7‑14 days), similar speedups were observed.

4. Summary

Database performance tuning is a meticulous process; for a high‑performance engine like ClickHouse, iterative profiling, code review, and demos are essential. The most rewarding moments are the “Eureka” insights that finally explain the bottlenecks. Early ClickHouse deployments used Bitmap32 without noticeable issues; as business needs grew to UInt64‑based identifiers, Bitmap64 performance degraded, prompting the BitBooster optimizations described here. The WeOLAP team has contributed over 100 PRs to the ClickHouse community, positioning Tencent as a leading open‑source contributor.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PerformanceOptimizationClickHouseBitmap
WeChat Backend Team
Written by

WeChat Backend Team

Official account of the WeChat backend development team, sharing their experience in large-scale distributed system development.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.