ClickHouse: Principles, Architecture, and Deployment at Youzan
The article explains ClickHouse’s high‑performance columnar OLAP design, its vectorized execution, sparse primary‑key indexes and MergeTree engines, contrasts it with ROLAP/MOLAP approaches, and details Youzan’s large‑scale deployment—including dual‑replica clusters, ingestion pipelines, routing architecture, current challenges, and future container‑based expansion plans.
This article introduces ClickHouse, a high‑performance open‑source OLAP engine developed by Yandex, and describes its simple principles, the related OLAP components at Youzan, and the practical deployment of ClickHouse in the Youzan data platform.
ClickHouse is a C++‑based columnar store designed for online analytical processing. It originated from Yandex Metrica, a traffic‑analysis product similar to Google Analytics.
OLAP engines can be classified into three types: ROLAP (relational OLAP) that aggregates raw data on‑the‑fly (e.g., Presto, Impala), MOLAP (multidimensional OLAP) that pre‑aggregates data during ingestion (e.g., Kylin, Druid), and Hybrid OLAP, a mix of both.
Both ClickHouse and Druid evolved from a MySQL → KV‑pre‑compute → engine pipeline, but ClickHouse chose the ROLAP model (while still supporting materialized views), whereas Druid adopted MOLAP.
ClickHouse achieves its speed through a vectorized execution engine, code‑generation that reduces virtual‑function overhead, sparse primary‑key indexes, multi‑core vertical scaling, and distributed horizontal scaling.
Key features of ClickHouse include broad SQL support, columnar storage with high compression, real‑time data ingestion, multi‑threaded and multi‑shard parallel processing, a vectorized engine with generated code, and primary‑key sparse indexes plus secondary Bloom‑filter indexes.
Limitations are the lack of fast low‑latency updates/deletes, point‑lookup performance penalties due to sparse indexes, and the absence of transaction support.
The engine’s performance is further boosted by low‑latency ROLAP queries, materialized views for selective pre‑aggregation, and numerous low‑level optimizations such as specialized hash tables for different data types (see the Aggregator example in the original article).
Typical application scenarios include user‑behavior analysis, real‑time log monitoring, and real‑time data‑warehouse queries.
ClickHouse’s internal workflow starts with parsing an incoming SQL statement into an AST, then the Interpreter converts the AST into a QueryPlan and builds a Pipeline. Processors (Source/Transform/Sink) are linked via QueryPlanStep to form a QueryPipeline. The expression subsystem works as follows: ExpressionAnalyzer → ActionsVisitor → ExpressionActions → ExpressionStep → ExpressionTransform.
MergeTree, the default table engine, stores data in sorted parts similar to LSM trees but without a MemTable or WAL. Parts are merged in the background; therefore batch writes are required to avoid “Too Many Parts” errors.
The primary‑key index is a sparse index with a default granularity of 8192 rows. Each column has accompanying *.mrk and *.bin files; queries perform a binary search on the sparse index to locate the required data blocks.
ClickHouse provides several MergeTree families, each with specific semantics: ReplacingMergeTree – deduplication on merge (can be forced with OPTIMIZE…FINAL ). CollapsingMergeTree – asynchronous delete of rows marked with a sign column. VersionedCollapsingMergeTree – adds a version column to support out‑of‑order inserts. AggregatingMergeTree – incremental aggregation for materialized view data. ReplicatedXXXMergeTree – adds replication for high availability.
Youzan’s OLAP stack evolved as follows: Presto was introduced in 2018 for ad‑hoc HDFS queries; Druid was added in 2019 for real‑time analytics; Kylin was adopted for precise offline bitmap queries; ClickHouse entered production around 2020 to handle massive real‑time detail queries.
Currently Youzan operates two ClickHouse clusters with 15 shards (dual‑replica), ingesting roughly 400 billion rows per day at 250‑300 MB/s, achieving an average query latency of about 400 ms.
Cluster deployment includes an LVS load balancer, an Apisix gateway (based on OpenResty) for rate‑limiting, circuit‑breaking, and logging, distributed tables for query routing, shard‑and‑replication for high availability, and Zookeeper for coordination.
Data ingestion is handled in two ways: offline writes use a modified WaterDrop (Spark) job that acquires a partition‑level lock, writes to a temporary table, and atomically swaps it; real‑time writes use Flink SQL or Flink JAR jobs that stream Kafka data directly into ClickHouse tables.
Technical details worth noting: JDBC HTTP client is preferred for better proxy compatibility. Batch writes are mandatory to keep the number of parts manageable. Writes target local tables while reads use distributed tables to reduce Zookeeper load. Data sharding can be Random, Round‑Robin, or Hash‑based (the latter uses a hash key to map rows to specific shards). When querying distributed tables, the distributed_group_by_no_merge setting may be required to avoid costly distributed joins. Bitmap‑based tag pre‑computation and the sequenceMatch() function are used for DMP/CDP user‑segmentation.
Clients issue SQL queries via the JDBC client, which sends HTTP requests to Apisix; the gateway routes the request to the appropriate ClickHouse shard, typically querying a distributed table.
Youzan contributed a fix to the clickhouse‑copier tool for handling non‑partitioned tables (Fix #15235).
Key application scenarios include DMP/CDP tag‑based audience selection using ClickHouse bitmap pre‑computation, real‑time user‑segmentation, and SCRM features such as state back‑tracking, flexible dimensions, dynamic selection, and cross‑store de‑duplication.
Current pain points are data rebalancing after scaling, sub‑optimal join performance (especially for large‑scale distributed joins), and the inability to efficiently update individual rows.
Future plans involve containerizing ClickHouse deployments for better elasticity, expanding its usage to more business lines (real‑time data‑warehouse, reporting), enhancing platform features (multi‑tenant isolation, rate‑limiting, monitoring), maintaining a dual‑track with Druid for stability, and exploring newer OLAP technologies such as Apache Doris, TiDB+TiFlash, Iceberg/Hudi data lakes, GPU‑accelerated engines, and cloud‑native solutions like Snowflake.
For more information, readers can contact the Youzan big‑data team at [email protected] or refer to the listed external links.
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.