How ClickHouse Powers Real-Time User Profiling for Millions of Users
This article explains how Suning built a massive user‑tag platform using ClickHouse and RoaringBitmap to enable second‑level user profiling for over 600 million users, covering tag generation, storage, query architecture, and performance optimizations for marketing campaigns.
Business Scenario
To run a marketing campaign such as a Double 11 promotion, Suning needs to select a target audience of about ten million users from a pool of over 600 million, then analyze their profiles to ensure they match the desired characteristics.
Tag Selection and Estimation
Users are filtered by combining attribute tags (e.g., gender, age, membership level) and behavior tags (e.g., browsing, adding to cart, purchasing). The platform estimates the size of the resulting audience in real time.
Example tag logic: users aged 25‑36 AND with "smart home" characteristics, EXCLUDING those who spent less than 10 CNY in the past 30 days.
{ {用户年龄 25-36} ∩ {智能家居人群} } - {30天消费小于10元}Technical Challenges
Large number of audience packages.
Each package contains a massive user base.
Real‑time computation of results is difficult.
Previous Solution (ElasticSearch)
The original system stored tags in a wide table in ElasticSearch, requiring batch insertion and later association before querying, which caused delays when upstream data pipelines lagged.
Why ClickHouse?
ClickHouse is an open‑source columnar database optimized for OLAP queries, offering superior query performance. Suning integrated RoaringBitmap into ClickHouse to provide efficient bitmap operations for tag storage and analysis.
Bitmap Integration
Each tag is stored as a bitmap of user IDs, enabling fast set operations (union, intersection, difference) while reducing storage space.
New Architecture Based on ClickHouse
The architecture consists of a ClickHouse Manager for cluster and metadata management, Spark jobs that generate and import tag data from HDFS into ClickHouse, and a Proxy layer that translates user queries into ClickHouse SQL.
Tag tables use a row‑to‑column design where each tag name becomes a partition and the uv column stores a bitmap of user IDs.
CREATE TABLE ch_label_string (
labelname String, -- tag name
labelvalue String, -- tag value
uv AggregateFunction(groupBitmap, UInt64) -- user ID bitmap
) ENGINE = AggregatingMergeTree()
PARTITION BY labelname
ORDER BY (labelname, labelvalue)
SETTINGS index_granularity = 128;Data is sharded by user ID, allowing parallel import at ~1.5 million rows/second per node. Queries run on a single shard, and results are summed across shards, exploiting ClickHouse’s distributed parallelism.
Query Flow
Using a WITH clause, the system builds bitmap objects for the selected audience, then applies bitmap functions for set operations. Distributed proxy tables route queries to the appropriate shards without storing data themselves.
-- Example query
CREATE TABLE ch_agent_user (agentname String) ENGINE = MergeTree() PARTITION BY agentname ORDER BY (agentname) SETTINGS index_granularity = 8192;
CREATE TABLE ch_agent_dist_user AS ch_agent_user ENGINE = Distributed('cluster_test','test','ch_agent_user',cityHash64(agentname));
SELECT sum(user_number) AS user_number
FROM ch_agent_dist_user
RIGHT JOIN (
WITH (
SELECT groupBitmapState(userid) AS users0 FROM ch_label_string WHERE labelname = 'T'
) AS users0
SELECT 'agent' AS agentname, bitmapCardinality(users0) AS user_number
) USING (agentname) SETTINGS enable_scalar_subquery_optimization = 0;User Profiling Evolution
Three architecture versions were implemented:
V1 – Wide Table: A single table with many feature columns; suffered from schema changes and slow group‑by on large audiences.
V2 – Bitmap: Each feature value stored as a bitmap; improved performance but still slow for very large audiences due to disk deserialization.
V3 – Join Table Engine: Tags stored in ClickHouse’s Join engine, keeping bitmaps in memory; reduced average query time to 1‑2 seconds for tens of millions of users.
Summary
Integrating RoaringBitmap into ClickHouse and leveraging the Join table engine dramatically improved the tag platform’s data analysis capabilities, achieving sub‑2‑second query latency, parallel tag generation, near‑real‑time updates, and reduced hardware costs compared to the previous ElasticSearch solution.
Future Plans
Upgrade to 64‑bit RoaringBitmap.
Introduce a smarter caching layer to boost query concurrency.
Support offline generation of ClickHouse data files for faster tag readiness.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Suning Technology
Official Suning Technology account. Explains cutting-edge retail technology and shares Suning's tech practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
