How ByteHouse’s Bitmap Engine Supercharges Real‑Time Audience Segmentation
This article explains how ByteHouse leverages a native Bitmap data type and dictionary encoding to accelerate real‑time audience segmentation queries in advertising scenarios, achieving up to 50× performance gains over traditional array‑based models.
Background
As traffic growth slows, advertisers are shifting from broad, high‑volume campaigns to fine‑grained marketing, which requires selecting the most promising audiences from billions of users, posing significant challenges for data warehouse performance.
ClickHouse’s high‑performance, distributed architecture makes it a popular choice for large‑scale analytics, and ByteDance has built ByteHouse, a cloud‑native data warehouse based on open‑source ClickHouse, to support real‑time and offline analysis for advertising workloads.
Audience Segmentation Scenario
Audience selection is a core function of Customer Data Platforms (CDP). Analysts combine various tags to create target groups for precise ad delivery, often iterating many times to refine the best audience package. This leads to two main issues:
Offline pre‑computation cannot handle the massive number of possible tag combinations.
Real‑time queries can take minutes, which is too slow for analysts.
ByteHouse can answer these queries in under 10 seconds for a billion‑user test set (P99 < 10 s).
Data Model Evolution
Traditional user‑centric storage keeps one row per user with many columns (e.g., user_id, sex, age, tags). Filtering by tag combinations forces a full table scan, causing performance to degrade as users and tags grow.
Switching to a tag‑centric model retains only dimensions relevant to audience selection. Each tag stores an array of user IDs (active_users), dramatically reducing row count and data size.
In this model, selecting users for a tag combination becomes a set‑operation (intersection, union, difference), offering substantial speed improvements.
ByteHouse Bitmap Type
Using the Bitmap type, the storage schema changes to:
CREATE TABLE id_tags (
tags String,
active_users BitMap64
) Engine = CnchMergeTree() ORDER BY tagsQueries that previously required multiple sub‑queries can now be expressed with a single bitmap operation, e.g.:
SELECT bitmapCount('tag_1&tag_2') FROM tag_uids_mapThis reduces scanning to a single pass and yields 10‑50× performance gains in multi‑tag scenarios.
Data Ingestion
Inserting data into a bitmap table is similar to inserting into a regular table; the array of user IDs is automatically converted to a Bitmap64.
INSERT INTO id_tags VALUES ('tag_1', [2,4,6]), ('tag_2', [1,3,5])ByteHouse also supports bulk imports via offline (TOS, LASFS) and streaming (Kafka, Flink) pipelines, all handling Bitmap data natively.
Related Functions
ByteHouse provides column functions such as bitmapColumnAnd for AND‑operations across bitmap columns and bitmapColumnCardinality for counting distinct elements.
Bitmap Engine Principles
Standard bitmaps would require 2³² bits (~512 MB) per user ID space, which is impractical. Roaring Bitmaps compress sparse ID spaces by dividing the 32‑bit space into 16‑bit buckets; empty buckets are omitted, and each bucket uses either an array container (for sparse data) or a bitmap container (for dense data).
For 64‑bit IDs, ByteHouse maps the high 32 bits to a map key and stores the low 32 bits in a Roaring bitmap.
Dictionary Optimization
Because user IDs are not sequential, many array containers appear, slowing set operations. ByteHouse applies dictionary encoding to map original IDs to dense internal values, improving bitmap compression and query speed.
CREATE TABLE id_tags (
tags String,
active_users BitMap64 BitEngineEncode
) Engine = CnchMergeTree() ORDER BY tagsThe dictionary is maintained internally and updates asynchronously as the base table changes.
Conclusion
Audience analysis is fundamental to CDP functionality. By leveraging ByteHouse’s native Bitmap type, dictionary encoding, and optimized functions, real‑time audience queries become dramatically faster, enabling interactive big‑data analytics on the Volcano Engine platform.
Volcano Engine Developer Services
The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
