How ClickHouse Powers Billion‑User Tagging with Efficient Bitmap Storage
This article explains how ClickHouse’s columnar storage, compression, and bitmap functions enable fast, scalable processing of billions of user tags and groups in a CDP, covering data storage design, bitmap generation, and distributed query optimization.
1) ClickHouse Overview
To address the challenges of storing and processing billions of tag and group data in a CDP, we use ClickHouse (CK), an open‑source high‑performance analytical SQL database developed by Yandex. It is a column‑oriented DBMS designed for OLAP workloads.
Complete DBMS features: DML, DDL, permission control, backup, recovery, distributed computing.
Columnar storage and compression reduce the amount of data scanned and lower I/O.
Relational model with standard SQL makes it easy to learn and integrate.
Data sharding and distributed query execution improve performance.
2) Solving Data Storage Issues
Source tables reside in a data warehouse, but traditional query engines (Hive, Spark, Presto) cannot meet our analytical needs. ClickHouse stores both raw source tables and generated bitmap data, leveraging its columnar storage to achieve lower storage cost per GB and efficient use of SSD and memory.
3) Solving Bitmap Storage Issues
ClickHouse does not have a native bitmap type, but it provides the AggregateFunction data type (e.g., groupBitmap) to store bitmap states. Example table definition:
CREATE TABLE cdp.tag ON CLUSTER DEFAULT (
`code` String,
`value` String,
`version` String,
`offset_bitmap` AggregateFunction(groupBitmap, UInt64)
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/cdp/{shard}/group_1', '{replica}')
PARTITION BY (code) ORDER BY (code)
SETTINGS storage_policy = 'default', use_minimalistic_part_header_in_zookeeper = 1, index_granularity = 8192;Thus both source data and processed bitmap can be stored in ClickHouse, with data pipelines moving data from the warehouse to CK.
4) Solving Bitmap Generation
ClickHouse offers many bitmap functions (see its documentation). Using a source table tag.source, we can generate bitmap tables for education, age, and gender tags:
# Education tag
INSERT INTO cdp.tag_bitmap1
SELECT 'tag_education' AS code,
value AS value,
'version1' AS version,
groupBitmapState(offset) AS offset_bitmap
FROM tag.source
WHERE value IN ('本科','硕士')
GROUP BY code, value;
# Age tag
INSERT INTO cdp.tag_bitmap2
SELECT 'tag_age' AS code,
value AS value,
'version2' AS version,
groupBitmapState(offset) AS offset_bitmap
FROM tag.source
WHERE value IN ('20-24','25-28')
GROUP BY code, value;
# Gender tag
INSERT INTO cdp.tag_bitmap3
SELECT 'tag_gender' AS code,
value AS value,
'version3' AS version,
groupBitmapState(offset) AS offset_bitmap
FROM tag.source
WHERE value IN ('男','女')
GROUP BY code, value;After creating the three tag bitmaps, we can compute the target group (e.g., females with a bachelor's degree aged 20‑35) with:
WITH (
SELECT groupBitmapOrStateOrDefault(offset_bitmap) FROM cdp.tag_bitmap1 WHERE value IN ('本科')
) AS bitmap1,
(
SELECT groupBitmapOrStateOrDefault(offset_bitmap) FROM cdp.tag_bitmap2 WHERE value IN ('20-24','25-28')
) AS bitmap2,
(
SELECT groupBitmapOrStateOrDefault(offset_bitmap) FROM cdp.tag_bitmap3 WHERE value IN ('女')
) AS bitmap3
SELECT bitmapAnd(bitmapAnd(bitmap1, bitmap2), bitmap3) AS bitmap;5) Solving Performance Bottlenecks
If source tables are extremely large, bitmap generation can become slow or time‑out. Deploying ClickHouse in a multi‑shard, multi‑replica cluster (at least two replicas per shard) and using Zookeeper for metadata ensures high performance and high availability. A three‑node Zookeeper ensemble is recommended.
Cluster configuration can be inspected via the system table system.clusters .
ClickHouse distinguishes between distributed tables (logical views that route queries to all nodes) and local tables (actual storage on each node). To avoid overloading a single node, queries can be executed on each node via JDBC and aggregated in the application, with careful handling of aggregate functions such as SUM/COUNT for averages.
6) Current Status and Outlook
All tag and group versions are now stored in ClickHouse, and the latest successful version is used by default. This explains why previously successful groups remain usable even after subsequent processing failures.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JD Cloud Developers
JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
