Using ClickHouse for Efficient Tag Bitmap Storage and Group Computation in a CDP
This article explains how ClickHouse’s columnar storage, bitmap functions, and distributed architecture can be leveraged to store billions of tag bitmaps, combine them efficiently, and support fast group calculations for customer data platforms, while addressing data‑warehouse integration, storage format, and performance challenges.
The article begins by describing the background of a Customer Data Platform (CDP) where billions of user tags and groups are stored, and raw warehouse data must be transformed into business‑ready source tables such as gender, education, age, and purchase behavior.
It then outlines three key problems: converting many source tables into tag bitmaps, deciding where to store those bitmap files for easy reuse, and performing fast bitmap combination to generate target audiences (e.g., females with a bachelor’s degree aged 20‑35).
Solution – ClickHouse Overview
ClickHouse, an open‑source high‑performance OLAP columnar DBMS, is introduced. Its main features—full SQL support, column‑oriented storage with strong compression, distributed sharding, and built‑in bitmap functions—make it suitable for the CDP use case.
Data Storage
Instead of row‑based storage in Hive/Spark/Presto, the article shows how ClickHouse stores both raw source data and processed bitmaps. Example table‑creation SQL is provided:
CREATE TABLE cdp.tag ON CLUSTER DEFAULT (
`code` String,
`value` String,
`version` String,
`offset_bitmap` AggregateFunction(groupBitmap, UInt64)
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/cdp/{shard}/group_1', '{replica}')
PARTITION BY (code) ORDER BY (code)
SETTINGS storage_policy = 'default', use_minimalistic_part_header_in_zookeeper = 1, index_granularity = 8192;Both raw rows and bitmap aggregates can be persisted in ClickHouse, with data pipelines feeding the warehouse data into the cluster.
Bitmap Generation
Using ClickHouse’s bitmap functions, three separate INSERT statements generate bitmaps for education, age, and gender tags:
# Education tag
INSERT INTO cdp.tag_bitmap1
SELECT 'tag_education' AS code, value AS value, 'version1' AS version,
groupBitmapState(offset) AS offset_bitmap
FROM tag.source
WHERE value IN ('本科','硕士')
GROUP BY code, value;
# Age tag
INSERT INTO cdp.tag_bitmap2
SELECT 'tag_age' AS code, value AS value, 'version2' AS version,
groupBitmapState(offset) AS offset_bitmap
FROM tag.source
WHERE value IN ('20-24','25-28')
GROUP BY code, value;
# Gender tag
INSERT INTO cdp.tag_bitmap3
SELECT 'tag_gender' AS code, value AS value, 'version3' AS version,
groupBitmapState(offset) AS offset_bitmap
FROM tag.source
WHERE value IN ('男','女')
GROUP BY code, value;After the three bitmaps are created, a final query combines them to obtain the desired audience (bachelor’s degree, age 20‑35, female):
WITH (
SELECT groupBitmapOrStateOrDefault(offset_bitmap) FROM cdp.tag_bitmap1 WHERE value IN ('本科')
) AS `bitmap1`,
(
SELECT groupBitmapOrStateOrDefault(offset_bitmap) FROM cdp.tag_bitmap2 WHERE value IN ('20-24','25-28')
) AS `bitmap2`,
(
SELECT groupBitmapOrStateOrDefault(offset_bitmap) FROM cdp.tag_bitmap3 WHERE value IN ('女')
) AS `bitmap3`
SELECT bitmapAnd(bitmapAnd(`bitmap1`, `bitmap2`), `bitmap3`) AS bitmap;Performance Optimisation
To avoid long processing times on massive source tables, the article recommends deploying ClickHouse in a multi‑shard, multi‑replica configuration with Zookeeper‑managed metadata, ensuring high performance and high availability.
It also explains the distinction between distributed tables (logical views that route queries to all shards) and local tables (actual data storage on each node), and advises aggregating results in the application when necessary.
Current Status and Outlook
All tag and group versions are now stored in ClickHouse, guaranteeing that the latest successful version is used for any subsequent computation. The article concludes with references to Chinese and English ClickHouse documentation and a link to a related JD Tech case study.
JD Tech Talk
Official JD Tech public account delivering best practices and technology innovation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.