How ClickHouse Powers Real-Time OLAP Monitoring at Suning Big Data Platform
This article explains how Suning's big‑data center leverages ClickHouse’s columnar OLAP engine and a full‑chain monitoring platform to achieve real‑time query tracing, slow‑query analysis, cluster health checks, and resource‑level alerts across diverse business scenarios.
Introduction
ClickHouse is a high‑performance OLAP analysis engine known for its columnar storage, LSM‑Tree engine, vectorized execution, asynchronous merge and mutation mechanisms, and MPP+SMP concurrency, making it especially strong in single‑table analysis and colocate joins.
At Suning Big Data, ClickHouse is widely used for high‑cardinality data analysis, precise deduplication, interactive analysis, wide‑table queries, time‑series storage, and materialized view aggregation, supporting business scenarios such as new/old buyer identification, repurchase, retention, real‑time user profiling, and audience segmentation.
Suning Full‑Chain Monitoring Platform
The platform provides end‑to‑end visibility from the front‑end request to the OLAP engine, generating unique identifiers (serialId, traceId, olapId) for each HTTP request, metric‑service call, and OLAP query, forming a tree‑shaped request chain.
Key design steps include:
Back‑end report system creates a unique traceId and hierarchical spanId for each request and passes them to the metric service.
The metric service calculates a query priority, generates a new spanId, and forwards traceId and spanId to the OLAP layer.
Spark RDD links queryId from Druid, PostgreSQL, and ClickHouse with Spark stage and job IDs.
OLAP generates further spanIds and propagates traceId downstream.
Each OLAP engine maps traceId to its internal execution path for detailed timing analysis.
Integrating ClickHouse into Full‑Chain Monitoring
ClickHouse monitoring covers node, shard, parent/child query relationships, query latency, memory usage, CPU usage, row counts, MergeTree status, and thread participation.
To propagate traceId to ClickHouse, the system customizes the query_id field, mapping a user‑defined query_id to the internally generated one, enabling end‑to‑end tracing.
ClickHouse-client --port 1***5 --time --format=Null --query="select count() from aggr_member" --query_id="suning20200706"
echo 'select count() from aggr***member' | curl 'http://localhost:8**3/?query_id=suning2020&query=' --data-binary @-ClickHouse Slow‑Query Monitoring
Real‑time slow‑query monitoring visualizes per‑node status, including query hierarchy, execution scripts, latency, rows read, memory consumption, peak memory, thread count, user, profile events, and settings.
Historical slow‑query monitoring relies on the system.query_log table, which must be enabled by setting <log_queries>1</log_queries> in users.xml. The table records query success, error, and pre‑execution events, storing detailed node‑level metrics for each query.
MergeTree Monitoring
MergeTree tables store data in parts; background merges combine small parts while balancing merge speed and part count. Monitoring focuses on MergeTree exceptions and generates alerts when anomalies occur.
Slow‑Query Attribution
Root causes of slow queries include missing page cache, high‑cardinality aggregation or sorting, suboptimal join order, excessive CPU/memory usage, and lack of resource isolation. Optimizations such as hash‑sharding member IDs with RoaringBitmap and using ClickHouse‑CTE WITH clauses can dramatically improve performance.
Cluster and Node State Monitoring
Metrics include query success/failure counts, node connection numbers, CPU and memory usage, file handles, partition usage, and replica delay. Alerts are triggered via HTTP API checks that return 200 OK for healthy nodes.
Chproxy, the ClickHouse HTTP proxy and load balancer, is managed by ClickHouse Manager, which handles lifecycle operations and configuration synchronization via Zookeeper. Chproxy monitors request queue size, client connections, total queries, cancellations, rejections, cache hits, and health status, all integrated into Suning’s full‑chain monitoring platform.
Advantages and Outlook
The full‑chain monitoring platform extends monitoring depth and breadth, providing end‑to‑end visibility across data application, SparkSQL parsing, OLAP routing, and acceleration layers. It unifies query latency across multiple OLAP engines via a common queryId, enabling proactive alerts and streamlined troubleshooting. Future work includes integrating additional OLAP engines such as Doris, Dremio, and RocksDB, and further refining intra‑engine execution timing analysis.
Suning Technology
Official Suning Technology account. Explains cutting-edge retail technology and shares Suning's tech practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
