Applying ClickHouse for Real‑Time Advertising Audience Estimation at ByteDance
This article details how ByteDance leverages ClickHouse to power large‑scale advertising audience estimation, profiling, and statistical analysis, describing the challenges of massive data, strict latency requirements, and the evolution from a simple tag‑uid table to a bitmap‑based architecture with extensive parallel and cache optimizations.
ByteDance's advertising platform processes billions of users and uses ClickHouse as the core engine for online analysis, covering audience estimation, profiling, and statistical analysis.
Audience estimation requires fast set operations (intersection, union, complement) on large user groups, with response time under 5 seconds.
Challenges include massive data volume, complex queries, and strict latency.
ClickHouse was chosen over Druid, Elasticsearch and Spark for its speed on wide tables and flexible architecture.
Version 1 stores tag‑uid pairs in a two‑column table and translates set operations into SQL with sub‑queries; optimizations focus on parallel execution and fast distinct counting.
A&(B|C)
SELECT count distinct(uid)
FROM tag_uid_map
WHERE tag_id = A
AND uid IN (
SELECT distinct uid
FROM tag_uid_map
WHERE (tag_id = B) OR (tag_id = C)
)Version 2 replaces detailed storage with a Bitmap64 column using RoaringBitmap, reducing space and simplifying queries; further optimizations include data sharding, parallel bitmap computation, cache layers, and low‑level instruction acceleration.
Extensive engineering changes to the read‑execute model, block size, secondary indexes, and caching dramatically cut query latency, storage size, and resource usage, achieving sub‑5‑second response for most queries.
Future work will target deeper computation and data optimizations, smarter caching, and richer expression support.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.