Databases 23 min read

Applying ClickHouse Bitmap and BSI Techniques for Real-Time Audience Selection in a Data Management Platform

By integrating ClickHouse bitmap structures, a dictionary service for dense ID mapping, and Bit‑Slice Indexes, Bilibili’s Data Management Platform now supports flexible, multi‑dimensional audience selection and profiling over petabyte‑scale data with minute‑level latency, cutting storage by over twenty‑fold and query times from hours to seconds.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Applying ClickHouse Bitmap and BSI Techniques for Real-Time Audience Selection in a Data Management Platform

Business Background

DMP (Data Management Platform) provides the advertising department with Bilibili user data management, including user tag collection, tag market construction, audience package selection, audience profiling, and audience/feature targeting. The core functions are audience package selection and profiling, which demand high design, performance, scalability, and maintainability. ClickHouse bitmap capabilities are used to achieve real‑time audience estimation and minute‑level profiling.

1.1 Audience Selection

Audience selection allows users to define rules based on tags (e.g., gender, age, region) and retrieve the matching user set. The system must output the count and the audience package in real time. Challenges include flexible rule combinations over hundreds of tags and PB‑scale raw data, making direct Hive queries infeasible.

1.2 Audience Profiling

After an audience package is built, profiling calculates distributions of age, gender, region, interests, etc. This requires intersecting the audience package IDs with each tag’s ID set and computing percentages or TGI. The original offline Hive solution took ~2 hours per package; ClickHouse bitmap reduced this to minutes.

2 Bitmap‑Based Audience Selection

Bitmap is a data structure representing a set of integers as a sequence of bits. Bilibili user IDs are 64‑bit integers, so a bitmap can efficiently represent an audience package. By creating a bitmap for each tag value, audience selection becomes a series of bitmap AND/OR/NOT operations.

Technical challenges addressed:

Support flexible, multi‑dimensional tag rules over hundreds of tags and PB‑scale data.

Handle daily creation of thousands of audience packages.

Additional engineering solutions include:

Scalable tag selection supporting >300 tags and >1 PB data.

Bidirectional data flow between the offline warehouse and ClickHouse.

Optimized bitmap storage to reduce size.

DSL on top of ClickHouse for easier usage.

2.1 Bitmap Issues in Practice

Roaring Bitmap performance degrades with sparse, non‑continuous IDs, leading to large bitmap sizes and high CPU/I/O costs.

2.2 Non‑Discrete Tag Selection Limitations

Non‑discrete metrics (e.g., ad exposure counts) require aggregation over time windows. The original approach pre‑aggregates fixed windows (30, 60, 90 days), causing data redundancy and limited flexibility.

Example Hive‑style query for non‑discrete metrics:

SELECT id FROM (SELECT id, unit_id FROM ad_event_table WHERE
event='show' -- 曝光
AND unit_id IN (11111, 22222) -- 单元id列表
AND log_date >= '${yyyyMMdd, -360d}' -- 近一年
GROUP BY unit_id, id -- 按id + unit_id聚合
HAVING COUNT(1) > N -- 找出每个单元曝光次数大于N次的id ) t0
GROUP BY id -- 所有单元曝光次数> N的id进行去重

Running this directly on Hive would take minutes to hours.

ClickHouse query with pre‑aggregated bitmap tags:

SELECT groupBitmapOr(uid_index) AS `uid_index` -- id bitmap
FROM tag_index_table
WHERE (tag_name='ad_show') -- 广告曝光
AND (log_date='${yyyyMMdd}') -- 每天更新全量数据
AND (tag_value IN ('360D-11111','360D-22222')) -- 事先聚合好360天的单元曝光,把范围查询变成几个固定时间区间的点查
AND (metric>=N) -- 曝光指标 > N

While this meets second‑level latency, it still suffers from data redundancy and limited time‑range flexibility.

3 ClickHouse Dictionary Service

The dictionary service assigns sequential IDs to strings (e.g., device IDs, buvid), improving bitmap density and enabling non‑integer tag selection.

Data flow:

Original ID table → dictionary mapping (offline Hive export + RPC fallback) → bitmap aggregation.

Optimizations include:

90%+ offline mapping reduces online RPC pressure.

Shard IDs by high 48 bits to keep contiguous IDs in the same ClickHouse shard, saving ~30% storage and doubling speed.

4 BSI (Bit‑Slice Index) Overview

BSI stores a set of bitmaps, each representing one bit position of an integer metric. It enables efficient queries on continuous‑value metrics.

BSI implementation in ClickHouse:

CREATE TABLE test.bsi
(
`log_date` Date,
...
`ck_bucket` UInt32,
`bsi_agg` AggregateFunction(bsi_merge_agg, BSI)
)
ENGINE = AggregatingMergeTree
PARTITION BY log_date
ORDER BY ck_bucket
TTL ...

BSI functions such as bsi_build , bsi_filter , bsi_sum , bsi_range , bsi_topk , and aggregation functions bsi_add_agg , bsi_merge_agg are provided.

5 BSI + Dictionary Service in DMP

BSI is used to represent continuous metrics, allowing arbitrary time‑range audience selection without pre‑aggregation.

BSI table definition and query example:

CREATE TABLE tag_bitmap_bsi
(
`tag_name` String,
`tag_value` String,
`log_date` Date,
`sp_bucket` UInt32,
`sk_bucket` UInt32,
`ck_bucket` UInt32,
`bsi_agg` AggregateFunction(bsi_merge_agg, BSI)
)
ENGINE = ReplicatedAggregatingMergeTree(...)
PARTITION BY (toYYYYMMDD(log_date), tag_name)
ORDER BY (sp_bucket, tag_value, ck_bucket)
TTL ...
SELECT groupBitmapOr(bsi_ge(bsi_agg, N)) AS `uid_index` -- 180天内曝光次数> N次的id组成的bitmap
FROM
(
SELECT bsi_add_agg(bsi_agg) AS `bsi_agg`
FROM
(
SELECT tag_value, bsi_merge_aggMerge(bsi_agg) AS `bsi_agg` -- 一天之内的指标合并(去重)
FROM tag_bitmap_index_mapped_bsi
WHERE (tag_name = 'ad_show') AND (log_date > '${yyyyMMdd}' - INTERVAL 180 DAY -- 近180天 ) AND (tag_value IN ('11111', '22222'))
GROUP BY tag_value, log_date
)
GROUP BY tag_value -- 最终累加出180天内所有的指标
)

Performance gains:

Storage cost reduced by ~21× compared with pre‑aggregated bitmap.

Query latency improved from ~20 min to ~3 min for audience profiling; overall p50 latency ~500 ms, p90 ~5 s.

Write latency reduced by >2× for p90 and 1.5× for p50.

Conclusion

The integration of ClickHouse bitmap, dictionary service, and BSI dramatically enhances DMP capabilities, offering flexible, real‑time audience selection, lower storage overhead, and faster query response.

Big DataClickHouseBitmapaudience segmentationBSIData Management Platform
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.