How ClickHouse Powers Real-Time Self-Service Analytics at Scale
This article examines why ClickHouse was chosen as the OLAP engine for a massive self‑service analytics platform, describes the system architecture, shares concrete memory and performance tuning parameters, and outlines current challenges and future roadmap for large‑scale real‑time data analysis.
1. OLAP Technology Selection for Self‑Service Analytics
Background
The platform generates massive event data (exposures, clicks, displays) that require precise deduplication, retention, funnel calculations and other complex models. Offline pre‑aggregation cannot keep up with the need for ad‑hoc, multi‑dimensional reports, leading to long development cycles and stale data.
Selection Criteria
Three dimensions were evaluated: performance, flexibility, and complexity.
Performance – data volume (hundreds of millions to billions) and query latency (ms‑/s‑level).
Flexibility – support for both aggregated and raw queries, ingestion of offline and real‑time data, and high‑concurrency ad‑hoc queries.
Complexity – simple architecture, low operational overhead, easy onboarding, and strong extensibility.
Open‑source OLAP candidates were compared:
MOLAP (Kylin, Druid) – pre‑computed results give sub‑second latency but lack flexibility.
MPP‑based ROLAP (Impala, Presto) – real‑time ingestion and flexible queries, but performance depends heavily on resources.
ClickHouse – columnar storage, vectorized execution, millisecond‑level response, but limited distributed join and no full transaction support.
Doris – easy ops and transactional support, yet missing built‑in funnel/retention functions.
Considering the criteria, ClickHouse was selected as the core analysis engine.
2. The Gauss Self‑Service Analytics Platform
System Overview
The platform consists of two main functions: point‑event metadata management and self‑service analytics that provide multi‑dimensional, multi‑metric cross‑analysis, user‑profile tagging, audience selection, and AB‑test capabilities.
Architecture
Four layers are defined:
Data Ingestion Layer – business data from MySQL and log events are captured via Flink‑CDC to Kafka and Flume, then streamed to real‑time (Kafka) and offline (HDFS) paths.
Data Storage Layer – Kafka and HDFS store raw streams; ClickHouse stores cleaned wide tables for fast querying.
Data Service Layer – unified HTTP service exposing SQL‑based client tools.
Data Application Layer – the Gauss analytics UI and a user‑profile service built on top of ClickHouse.
ClickHouse Use Cases in the Platform
Behavior analysis – materialized views and aggregate tables provide real‑time metrics for activity pages.
AB‑test analysis – Flink streams Kafka events to ClickHouse, enabling real‑time experiment result calculation.
3. ClickHouse Optimization Practices
3.1 Memory Optimization
Memory‑intensive queries (e.g., COUNT/DISTINCT, GROUP BY, ORDER BY) can exceed server limits. Mitigations include using approximate functions and tuning max_bytes_before_external_group_by and max_bytes_before_external_sort.
3.2 Performance‑Tuning Parameters
max_concurrent_queries– raised from 100 to 150 to match cluster capacity. max_memory_usage – set to ~80% of total RAM per query. max_memory_usage_for_all_queries – set to 80‑90% of total RAM for the node. max_memory_usage_for_user & max_bytes_before_external_sort – trigger disk‑based spill for large GROUP BY/ORDER BY. background_pool_size – increased from 16 to 32 to accelerate background merges.
3.3 Billion‑Row JOIN
Data is pre‑partitioned by join key, ensuring that matching rows reside on the same node. Local joins are performed first, followed by a final aggregation, eliminating costly cross‑node distributed joins.
4. Future Plans and Outlook for ClickHouse
4.1 Current Pain Points
Limited high‑concurrency capability (officially ~100 QPS).
No transactional DDL or distributed transaction support; heavy reliance on ZooKeeper.
Lack of row‑level UPDATE/DELETE for real‑time use cases.
No automatic re‑balance; manual data re‑distribution required during scaling.
4.2 Roadmap
Platform‑level services and standardized fault handling, including multi‑tenant isolation, rate‑limiting, and fine‑grained monitoring.
Containerized deployment for compute‑storage separation and elastic scaling with automatic data re‑balancing.
Intelligent service architecture – dynamically selecting ClickHouse or Doris based on workload characteristics.
Kernel‑level enhancements: real‑time write consistency, distributed transaction support, and removal of ZooKeeper dependency.
5. Summary
The article presents the OLAP ecosystem landscape, details the Gauss platform’s architecture, shares concrete ClickHouse tuning techniques for massive data workloads, and discusses existing limitations and future directions needed to achieve cloud‑native elasticity and full feature parity.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
