Databases 10 min read

Using ClickHouse for Efficient Tag Bitmap Storage and Group Computation in a CDP

This article explains how ClickHouse’s columnar storage, bitmap functions, and distributed architecture can be leveraged to store billions of tag bitmaps, combine them efficiently, and support fast group calculations for customer data platforms, while addressing data‑warehouse integration, storage format, and performance challenges.

JD Tech Talk
JD Tech Talk
JD Tech Talk
Using ClickHouse for Efficient Tag Bitmap Storage and Group Computation in a CDP

The article begins by describing the background of a Customer Data Platform (CDP) where billions of user tags and groups are stored, and raw warehouse data must be transformed into business‑ready source tables such as gender, education, age, and purchase behavior.

It then outlines three key problems: converting many source tables into tag bitmaps, deciding where to store those bitmap files for easy reuse, and performing fast bitmap combination to generate target audiences (e.g., females with a bachelor’s degree aged 20‑35).

Solution – ClickHouse Overview

ClickHouse, an open‑source high‑performance OLAP columnar DBMS, is introduced. Its main features—full SQL support, column‑oriented storage with strong compression, distributed sharding, and built‑in bitmap functions—make it suitable for the CDP use case.

Data Storage

Instead of row‑based storage in Hive/Spark/Presto, the article shows how ClickHouse stores both raw source data and processed bitmaps. Example table‑creation SQL is provided:

CREATE TABLE cdp.tag ON CLUSTER DEFAULT (
  `code` String,
  `value` String,
  `version` String,
  `offset_bitmap` AggregateFunction(groupBitmap, UInt64)
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/cdp/{shard}/group_1', '{replica}')
PARTITION BY (code) ORDER BY (code)
SETTINGS storage_policy = 'default', use_minimalistic_part_header_in_zookeeper = 1, index_granularity = 8192;

Both raw rows and bitmap aggregates can be persisted in ClickHouse, with data pipelines feeding the warehouse data into the cluster.

Bitmap Generation

Using ClickHouse’s bitmap functions, three separate INSERT statements generate bitmaps for education, age, and gender tags:

# Education tag
INSERT INTO cdp.tag_bitmap1
SELECT 'tag_education' AS code, value AS value, 'version1' AS version,
       groupBitmapState(offset) AS offset_bitmap
FROM tag.source
WHERE value IN ('本科','硕士')
GROUP BY code, value;

# Age tag
INSERT INTO cdp.tag_bitmap2
SELECT 'tag_age' AS code, value AS value, 'version2' AS version,
       groupBitmapState(offset) AS offset_bitmap
FROM tag.source
WHERE value IN ('20-24','25-28')
GROUP BY code, value;

# Gender tag
INSERT INTO cdp.tag_bitmap3
SELECT 'tag_gender' AS code, value AS value, 'version3' AS version,
       groupBitmapState(offset) AS offset_bitmap
FROM tag.source
WHERE value IN ('男','女')
GROUP BY code, value;

After the three bitmaps are created, a final query combines them to obtain the desired audience (bachelor’s degree, age 20‑35, female):

WITH (
  SELECT groupBitmapOrStateOrDefault(offset_bitmap) FROM cdp.tag_bitmap1 WHERE value IN ('本科')
) AS `bitmap1`,
(
  SELECT groupBitmapOrStateOrDefault(offset_bitmap) FROM cdp.tag_bitmap2 WHERE value IN ('20-24','25-28')
) AS `bitmap2`,
(
  SELECT groupBitmapOrStateOrDefault(offset_bitmap) FROM cdp.tag_bitmap3 WHERE value IN ('女')
) AS `bitmap3`
SELECT bitmapAnd(bitmapAnd(`bitmap1`, `bitmap2`), `bitmap3`) AS bitmap;

Performance Optimisation

To avoid long processing times on massive source tables, the article recommends deploying ClickHouse in a multi‑shard, multi‑replica configuration with Zookeeper‑managed metadata, ensuring high performance and high availability.

It also explains the distinction between distributed tables (logical views that route queries to all shards) and local tables (actual data storage on each node), and advises aggregating results in the application when necessary.

Current Status and Outlook

All tag and group versions are now stored in ClickHouse, guaranteeing that the latest successful version is used for any subsequent computation. The article concludes with references to Chinese and English ClickHouse documentation and a link to a related JD Tech case study.

ClickHouseData WarehouseBitmapOLAPcolumnar storageCDP
JD Tech Talk
Written by

JD Tech Talk

Official JD Tech public account delivering best practices and technology innovation.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.