Big Data 16 min read

From BitMap to RoaringBitmap: Principles, Performance, and Big Data Applications

RoaringBitmap improves traditional BitMap by lazily allocating four container types, compressing sparse data, and dynamically switching between array, bitmap, and run containers, enabling fast exact set operations that power big‑data systems such as Kylin, ClickHouse, and B‑Station’s user‑visit and crowd‑package pipelines, dramatically reducing memory use and processing latency.

Bilibili Tech

Sep 30, 2022

From BitMap to RoaringBitmap: Principles, Performance, and Big Data Applications

When dealing with massive data, fast evaluation, computation, and intermediate storage are required. Probabilistic structures such as HyperLogLog and BloomFilter can quickly estimate cardinalities with small storage, but they do not provide exact counts.

BitMap addresses this need and has been used early in data processing and search engines. In Java, BitSet can replace HashSet for precise numeric deduplication; Redis also offers setbit and getbit. However, BitMap suffers from two obvious problems:

Its internal long[] array grows with the maximum value, leading to memory consumption in the megabyte or even gigabyte range and possible OOM when the value domain is large (e.g., Long.MAX_VALUE).

Storing a sparse bitmap of billions of values still requires allocating the full memory (e.g., 40 billion values need ~512 MB), which defeats the purpose of sparse storage.

To compress sparse bitmaps, several algorithms have been proposed, including WAH, EWAH, Concise, and RoaringBitmap. The first three rely on run‑length encoding (RLE); RoaringBitmap is an improved version and is the focus of this article.

RoaringBitmap (RBM) has two versions for 32‑bit and 64‑bit integers. For the 32‑bit version, the 32‑bit space is split into high‑16‑bit container IDs and low‑16‑bit container contents. Containers are allocated lazily, only when needed.

Containers come in four types:

BitmapContainer : stores a 16‑bit bitmap (up to 65 536 values) with a fixed 8 KB footprint.

ArrayContainer : stores sparse values in an array; it can hold up to 4 096 values with size (2+2*c)B (c = cardinality).

RunContainer : introduced in 2018, compresses consecutive sequences using RLE (e.g., (11,4) for 11‑14). Size is (2+4*r)B where r is the number of runs.

SharedContainer : allows multiple RBMs to share the same container, reducing duplication during copy operations.

Insertion rules:

Single values are stored in an ArrayContainer.

Continuous ranges (via addRange) create a RunContainer.

Irregular sequences trigger an optimization that chooses between ArrayContainer and RunContainer based on space.

If an ArrayContainer exceeds 4 096 entries, it is promoted to a BitmapContainer. Conversely, a BitmapContainer that drops below 4 096 entries may be demoted.

Because of its excellent storage and computation characteristics, RBM is used in many open‑source big‑data projects such as Kylin, Doris, Iceberg, and ClickHouse. The article then presents two concrete B‑Station (Bilibili) use cases.

Application 1: Unified User‑Visit Mark Model

User‑device access metrics (new users, retention, frequency, MAU, etc.) are costly to compute and often scattered across different time windows. Three common solutions are compared:

Direct joins on the access table – flexible but high development cost.

External engines (e.g., ClickHouse Retention function) – fast for specific scenarios but requires data sync.

Transforming access logs into a generic bitmap (0/1) stored as an array. Each partition holds two elements representing 30‑day and 60‑day windows, enabling O(1) checks for recent activity.

The bitmap‑based solution is further refined using RBM. By storing the entire history in an RBM structure, any day’s retention, new users, DAU, MAU, or frequency can be computed on‑the‑fly without pre‑aggregation. The article shows a diagram of the RBM‑based model and mentions custom UDFs that operate on the RBM.

Application 2: Fast Synchronization of Crowd Packages

Titan, B‑Station’s tag‑profile platform, uses RBM to store enumerated tags, accelerate set operations, and keep crowd packages for downstream analysis. The traditional pipeline syncs Hive tag tables to ClickHouse, materializes per‑value arrays, and finally creates an RBM table via a materialized view. This process is resource‑intensive for tags with billions of users.

A new approach creates RBM crowd packages directly in the offline engine and syncs the binary representation (base64) to ClickHouse, cutting the end‑to‑end latency from 50 minutes to 8 minutes for a 1 billion‑user crowd.

For 64‑bit RBM, ClickHouse 21.1+ uses CRoaring ’s roaring64map . The serialized format consists of:

Type flag (Byte) – distinguishes SmallSet (≤32 values) from RBM.

Data length (VarInt).

High‑32‑bit cardinality (Long).

Actual data (ByteArray of RoaringBitmap).

Java’s RoaringBitmap library provides Roaring64NavigableMap, which stores data in a TreeMap‑like hierarchy. The article includes a code snippet that shows how to obtain the high‑32‑bit cardinality via reflection:

Method method = rbClass.getDeclaredMethod("getHighToBitmap", null);
method.setAccessible(true);
(NavigableMap<Integer, BitmapDataProvider>) method.invoke(rb, null);

It also provides a ClickHouse table definition for storing RBM crowd packages:

-- 创建CK人群测试表
CREATE TABLE new_titan.ads_titan_abtest_crowd_rbm_local on cluster bdp_tag_cluster (
  log_date Date COMMENT '日期',
  crowd_id UInt64 COMMENT '人群ID',
  type Int8  COMMENT '人群类型,ABTest:3',
  encode String COMMENT 'hive序列化存储',
  idx_bitmap AggregateFunction(groupBitmap, UInt64)
      MATERIALIZED base64Decode(encode),
  ctime DEFAULT now()
) Engine = ReplicatedAggregatingMergeTree()
PARTITION BY toYYYYMMDD(log_date)
PRIMARY KEY (crowd_id)
ORDER BY (crowd_id,log_date)
TTL log_date + INTERVAL 7 Day
SETTINGS index_granularity = 5;

With this setup, a 1 billion‑user crowd package’s processing time dropped by 85 % while the heavy lifting (UDF conversion) accounted for only a few seconds of the total pipeline.

Conclusion

The article explains the fundamentals and performance advantages of RBM, compares it with traditional BitMap, and demonstrates how B‑Station leverages RBM in several big‑data scenarios to reduce cost, improve efficiency, and simplify downstream consumption. It also suggests further directions such as using RBM to compress Redis bitmaps, accelerate Flink state deduplication, and enable real‑time feature‑group existence checks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data ClickHouse Hive RoaringBitmap Data Structures Bitmap Compression

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.