Databases 10 min read

How ByteHouse’s Bitmap Engine Supercharges Real‑Time Audience Segmentation

This article explains how ByteHouse leverages a native Bitmap data type and dictionary encoding to accelerate real‑time audience segmentation queries in advertising scenarios, achieving up to 50× performance gains over traditional array‑based models.

Volcano Engine Developer Services
Volcano Engine Developer Services
Volcano Engine Developer Services
How ByteHouse’s Bitmap Engine Supercharges Real‑Time Audience Segmentation

Background

As traffic growth slows, advertisers are shifting from broad, high‑volume campaigns to fine‑grained marketing, which requires selecting the most promising audiences from billions of users, posing significant challenges for data warehouse performance.

ClickHouse’s high‑performance, distributed architecture makes it a popular choice for large‑scale analytics, and ByteDance has built ByteHouse, a cloud‑native data warehouse based on open‑source ClickHouse, to support real‑time and offline analysis for advertising workloads.

Audience Segmentation Scenario

Audience selection is a core function of Customer Data Platforms (CDP). Analysts combine various tags to create target groups for precise ad delivery, often iterating many times to refine the best audience package. This leads to two main issues:

Offline pre‑computation cannot handle the massive number of possible tag combinations.

Real‑time queries can take minutes, which is too slow for analysts.

ByteHouse can answer these queries in under 10 seconds for a billion‑user test set (P99 < 10 s).

Data Model Evolution

Traditional user‑centric storage keeps one row per user with many columns (e.g., user_id, sex, age, tags). Filtering by tag combinations forces a full table scan, causing performance to degrade as users and tags grow.

Switching to a tag‑centric model retains only dimensions relevant to audience selection. Each tag stores an array of user IDs (active_users), dramatically reducing row count and data size.

In this model, selecting users for a tag combination becomes a set‑operation (intersection, union, difference), offering substantial speed improvements.

ByteHouse Bitmap Type

Using the Bitmap type, the storage schema changes to:

CREATE TABLE id_tags (
    tags String,
    active_users BitMap64
) Engine = CnchMergeTree() ORDER BY tags

Queries that previously required multiple sub‑queries can now be expressed with a single bitmap operation, e.g.:

SELECT bitmapCount('tag_1&tag_2') FROM tag_uids_map

This reduces scanning to a single pass and yields 10‑50× performance gains in multi‑tag scenarios.

Data Ingestion

Inserting data into a bitmap table is similar to inserting into a regular table; the array of user IDs is automatically converted to a Bitmap64.

INSERT INTO id_tags VALUES ('tag_1', [2,4,6]), ('tag_2', [1,3,5])

ByteHouse also supports bulk imports via offline (TOS, LASFS) and streaming (Kafka, Flink) pipelines, all handling Bitmap data natively.

Related Functions

ByteHouse provides column functions such as bitmapColumnAnd for AND‑operations across bitmap columns and bitmapColumnCardinality for counting distinct elements.

Bitmap Engine Principles

Standard bitmaps would require 2³² bits (~512 MB) per user ID space, which is impractical. Roaring Bitmaps compress sparse ID spaces by dividing the 32‑bit space into 16‑bit buckets; empty buckets are omitted, and each bucket uses either an array container (for sparse data) or a bitmap container (for dense data).

For 64‑bit IDs, ByteHouse maps the high 32 bits to a map key and stores the low 32 bits in a Roaring bitmap.

Dictionary Optimization

Because user IDs are not sequential, many array containers appear, slowing set operations. ByteHouse applies dictionary encoding to map original IDs to dense internal values, improving bitmap compression and query speed.

CREATE TABLE id_tags (
    tags String,
    active_users BitMap64 BitEngineEncode
) Engine = CnchMergeTree() ORDER BY tags

The dictionary is maintained internally and updates asynchronously as the base table changes.

Conclusion

Audience analysis is fundamental to CDP functionality. By leveraging ByteHouse’s native Bitmap type, dictionary encoding, and optimized functions, real‑time audience queries become dramatically faster, enabling interactive big‑data analytics on the Volcano Engine platform.

Data WarehouseSQL OptimizationBitmap IndexByteHouse
Volcano Engine Developer Services
Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.