How ClickHouse Powers Real-Time Self-Service Analytics at Scale
Facing massive daily data volumes and complex, ad‑hoc analytical needs, Zhaozhuan’s engineering team evaluated multiple OLAP engines and chose ClickHouse, then built a four‑layer self‑service analytics platform, detailing architecture, use‑cases, performance tuning, large‑scale joins, and future roadmap challenges.
1. Self‑Service Analytics OLAP Technology Selection
Zhaozhuan generates massive daily data from business operations and event tracking, requiring flexible, real‑time multidimensional analysis that traditional offline pre‑computation cannot satisfy. The team evaluated OLAP engines based on performance, flexibility, and complexity.
1.1 Background
The platform must support high‑volume data, precise deduplication, and complex models such as retention and funnel analysis.
1.2 OLAP Selection Considerations
Performance : data scale (hundreds of millions to billions) and query latency (ms‑level to second‑level).
Flexibility : ability to query both aggregated results and raw data, ingest offline and real‑time streams, and handle high‑concurrency ad‑hoc queries.
Complexity : simple architecture, low learning curve, easy operations, strong extensibility.
Based on these criteria, the team compared MOLAP engines (Kylin, Druid) and ROLAP engines (Impala, Presto). ClickHouse was selected for its columnar storage, vectorized execution, and strong query performance despite lacking full transaction support.
2. Zhaozhuan Platform Self‑Service Analytics Scenario
2.1 System Overview
The platform consists of two core functions: event data management (metadata, quality monitoring, alerts) and self‑service analytics (multidimensional, multi‑metric analysis, user profiling, AB testing).
2.2 System Architecture
The architecture is divided into four layers:
Data Ingestion Layer : Business data from MySQL and event logs from LOG files. Flink‑CDC streams MySQL changes to Kafka; Flume agents collect logs and forward them to Kafka (real‑time) and HDFS (offline).
Data Storage Layer : Kafka and HDFS store raw streams; ClickHouse ingests cleaned data into wide tables.
Data Service Layer : Unified HTTP service exposing SQL‑like client tools.
Data Application Layer : Self‑service analytics platform and user‑profile platform built on ClickHouse, providing event analysis, KPI calculations, retention, LTV, funnel, behavior analysis, and user segmentation.
2.3 ClickHouse Business Scenarios
Interactive Report Queries : Build wide tables for user behavior features; support second‑level responses for multi‑dimensional, multi‑metric calculations.
User Profile System : Construct feature tables for detailed user inspection and audience selection.
AB Testing : Use functions like grouparray to deliver experiment results within seconds.
Monitoring System : Real‑time metrics from Flink are written to ClickHouse and visualized via Grafana.
3. ClickHouse Optimization Practices
3.1 Memory Optimization
When query memory exceeds a server’s limit, ClickHouse throws an out‑of‑memory exception. Strategies include:
Use approximate functions for COUNT and DISTINCT to reduce memory usage.
Adjust max_bytes_before_external_group_by and max_bytes_before_external_sort for large GROUP BY / ORDER BY operations.
3.2 Performance Tuning Parameters
Key settings tuned at Zhaozhuan: max_concurrent_queries: increased from 100 to 150 to match cluster capacity. max_memory_usage: set to ~80% of total RAM per query. max_memory_usage_for_all_queries: 80‑90% of total RAM across all queries. max_memory_usage_for_user & max_bytes_before_external_sort: trigger disk spill for heavy GROUP BY / ORDER BY. background_pool_size: raised from 16 to 32 to accelerate background merges.
3.3 Billion‑Row Joins
Data is pre‑partitioned by join key, ensuring that rows with the same key reside on the same node. This allows local joins without cross‑node data movement, dramatically improving join performance for large datasets.
4. ClickHouse Future Planning and Outlook
4.1 Practical Pain Points
High‑concurrency capability is limited; official QPS recommendation is ~100 queries/s.
Lack of transactional DDL and distributed transaction support; heavy reliance on ZooKeeper for replication state.
Missing row‑level UPDATE/DELETE for real‑time use cases.
No automatic re‑balance; manual data migration required during scaling.
4.2 Future Directions
Platform‑as‑a‑service and fault‑standardization: multi‑tenant isolation, rate‑limiting, fine‑grained monitoring and alerting.
Containerized deployment for compute‑storage separation and elastic scaling with automatic data re‑balancing.
Intelligent service architecture: introduce Doris for high‑concurrency scenarios and dynamically select the optimal engine.
Kernel‑level optimizations: real‑time write consistency, distributed transaction support, and removal of ZooKeeper dependency.
5. Summary
To handle massive data volumes while delivering sub‑second query performance across diverse analytical scenarios, Zhaozhuan adopted ClickHouse as the core engine. Although ClickHouse excels in columnar storage and vectorized execution, it lacks built‑in elasticity, high‑concurrency robustness, and full transactional features, prompting ongoing efforts to containerize the service, improve scaling mechanisms, and integrate complementary engines for future cloud‑native deployments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
