Big Data 14 min read

How ClickHouse Powers Real-Time OLAP Monitoring at Suning Big Data Platform

This article explains how Suning's big‑data center leverages ClickHouse’s columnar OLAP engine and a full‑chain monitoring platform to achieve real‑time query tracing, slow‑query analysis, cluster health checks, and resource‑level alerts across diverse business scenarios.

Suning Technology

Sep 4, 2020

How ClickHouse Powers Real-Time OLAP Monitoring at Suning Big Data Platform

Introduction

ClickHouse is a high‑performance OLAP analysis engine known for its columnar storage, LSM‑Tree engine, vectorized execution, asynchronous merge and mutation mechanisms, and MPP+SMP concurrency, making it especially strong in single‑table analysis and colocate joins.

At Suning Big Data, ClickHouse is widely used for high‑cardinality data analysis, precise deduplication, interactive analysis, wide‑table queries, time‑series storage, and materialized view aggregation, supporting business scenarios such as new/old buyer identification, repurchase, retention, real‑time user profiling, and audience segmentation.

Suning Full‑Chain Monitoring Platform

The platform provides end‑to‑end visibility from the front‑end request to the OLAP engine, generating unique identifiers (serialId, traceId, olapId) for each HTTP request, metric‑service call, and OLAP query, forming a tree‑shaped request chain.

Key design steps include:

Back‑end report system creates a unique traceId and hierarchical spanId for each request and passes them to the metric service.

The metric service calculates a query priority, generates a new spanId, and forwards traceId and spanId to the OLAP layer.

Spark RDD links queryId from Druid, PostgreSQL, and ClickHouse with Spark stage and job IDs.

OLAP generates further spanIds and propagates traceId downstream.

Each OLAP engine maps traceId to its internal execution path for detailed timing analysis.

Integrating ClickHouse into Full‑Chain Monitoring

ClickHouse monitoring covers node, shard, parent/child query relationships, query latency, memory usage, CPU usage, row counts, MergeTree status, and thread participation.

To propagate traceId to ClickHouse, the system customizes the query_id field, mapping a user‑defined query_id to the internally generated one, enabling end‑to‑end tracing.

ClickHouse-client --port 1***5 --time --format=Null --query="select count() from aggr_member" --query_id="suning20200706"
echo 'select count() from aggr***member' | curl 'http://localhost:8**3/?query_id=suning2020&query=' --data-binary @-

ClickHouse Slow‑Query Monitoring

Real‑time slow‑query monitoring visualizes per‑node status, including query hierarchy, execution scripts, latency, rows read, memory consumption, peak memory, thread count, user, profile events, and settings.

Historical slow‑query monitoring relies on the system.query_log table, which must be enabled by setting <log_queries>1</log_queries> in users.xml. The table records query success, error, and pre‑execution events, storing detailed node‑level metrics for each query.

MergeTree Monitoring

MergeTree tables store data in parts; background merges combine small parts while balancing merge speed and part count. Monitoring focuses on MergeTree exceptions and generates alerts when anomalies occur.

Slow‑Query Attribution

Root causes of slow queries include missing page cache, high‑cardinality aggregation or sorting, suboptimal join order, excessive CPU/memory usage, and lack of resource isolation. Optimizations such as hash‑sharding member IDs with RoaringBitmap and using ClickHouse‑CTE WITH clauses can dramatically improve performance.

Cluster and Node State Monitoring

Metrics include query success/failure counts, node connection numbers, CPU and memory usage, file handles, partition usage, and replica delay. Alerts are triggered via HTTP API checks that return 200 OK for healthy nodes.

Chproxy, the ClickHouse HTTP proxy and load balancer, is managed by ClickHouse Manager, which handles lifecycle operations and configuration synchronization via Zookeeper. Chproxy monitors request queue size, client connections, total queries, cancellations, rejections, cache hits, and health status, all integrated into Suning’s full‑chain monitoring platform.

Advantages and Outlook

The full‑chain monitoring platform extends monitoring depth and breadth, providing end‑to‑end visibility across data application, SparkSQL parsing, OLAP routing, and acceleration layers. It unifies query latency across multiple OLAP engines via a common queryId, enabling proactive alerts and streamlined troubleshooting. Future work includes integrating additional OLAP engines such as Doris, Dremio, and RocksDB, and further refining intra‑engine execution timing analysis.

Cluster, node, shard connection monitoring

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Performance ClickHouse cluster OLAP

Written by

Suning Technology

Official Suning Technology account. Explains cutting-edge retail technology and shares Suning's tech practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.