Big Data 28 min read

How ClickHouse Powers YiBei's Scalable Advertising Data Platform

This article details YiBei's advertising data platform built on ClickHouse, covering business requirements, why ClickHouse was chosen over Druid, storage engine and compression choices, real‑time and offline ingestion pipelines, partitioning, Zookeeper bottlenecks, atomic data replacement, and testing and release strategies for a high‑throughput, low‑latency ad analytics system.

dbaplus Community

Jan 12, 2022

How ClickHouse Powers YiBei's Scalable Advertising Data Platform

Business Scenario Overview

YiBei operates a first‑party seller advertising platform where merchants pay per click or per sale to boost product visibility and revenue. Sellers need real‑time reports on click‑through rates, sales conversion, and other metrics across multiple ad placements such as home page, search, product pages, and mobile interfaces.

Why ClickHouse?

Compared with Druid, ClickHouse offers superior columnar compression, partition‑key optimization, full SQL support (joins, sub‑queries, primary‑key ordering), and a simpler MPP architecture that better utilizes CPU resources. Operationally, ClickHouse requires fewer node types and less complex component management, reducing maintenance overhead.

Benchmark tests show ClickHouse ingesting billions of rows in roughly 50 minutes per day, while Druid can take several hours depending on cluster load.

ClickHouse vs Druid performance comparison

Table Engine and Primary‑Key Design

YiBei uses ClickHouse's ReplicatedMergeTree engine to handle massive user‑behavior streams while providing high availability through replication. Data is pre‑aggregated on insert, reducing storage and query load.

Primary keys are kept minimal to fit in memory; non‑essential dimensions are removed from the key to improve query speed.

Compression and LowCardinality Optimizations

ClickHouse's default LZ4 compression is complemented by higher‑ratio algorithms such as ZSTD and LZ4HC for different column types. In production, LZ4HC level 6 saves ~30 % storage but reduces ingestion speed by ~60 %.

LowCardinality encoding on low‑cardinality columns (e.g., enums) cuts storage by up to 75 % in extreme tests.

Real‑Time Data Ingestion

Kafka topics are sharded per seller ID; each topic partition exceeds the number of ClickHouse replicas to allow parallel consumption. The pipeline writes directly from Kafka to local ClickHouse tables, achieving million‑level rows per second with sub‑second latency.

Offline Data Ingestion and Replacement

Batch jobs run on Spark via a LIVY server, generating pre‑partitioned files on HDFS. ClickHouse replica nodes pull these files, load them into temporary tables, and then replace whole partitions using the ATTACH / DETACH API, ensuring the old data remains invisible until the swap completes.

Partition Data Replacement and Atomicity

Version columns (A, B) and an active flag stored in a ClickHouse dictionary track the two most recent data versions. Swapping versions is performed by atomically updating the active column, providing instant visibility changes and rollback capability.

Zookeeper Pressure and Mitigation

Both real‑time and offline pipelines rely on Zookeeper for coordination, leading to thousands of client connections and session timeouts under heavy load. The solution involved reducing unnecessary global operations, splitting Zookeeper clusters by shard, and tuning operationTimeout to alleviate contention.

Data Quality Assurance

Quality checks occur at multiple stages: Spark output validation, ClickHouse client import verification, and a dedicated monitoring platform that alerts on anomalies, enabling rapid rollback of problematic versions.

Query Architecture

External APIs expose a unified data access layer. Queries are routed to ClickHouse for most workloads; high‑QPS or heavy‑aggregation requests are delegated to auxiliary stores via the query service layer.

Testing and Release Process

Major version upgrades employ a dual‑source strategy: traffic is gradually shifted from the old data source to the new one while mirrored queries validate consistency. This approach ensures safe rollouts without service disruption.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Real-time Advertising Storage Optimization Lambda architecture

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.