Databases 27 min read

Why ClickHouse Beats Elasticsearch for Microservice Governance Data at Scale

The article examines the data‑storage problems caused by rapid microservice growth, explains why traditional Hadoop/Spark stacks were rejected, presents benchmark comparisons that show ClickHouse’s superior performance and compression, and details practical ClickHouse deployment, schema design, sharding, TTL, indexing, and monitoring integrations for real‑time analytics.

dbaplus Community

Jan 27, 2022

Why ClickHouse Beats Elasticsearch for Microservice Governance Data at Scale

1. Data Storage Challenges in Microservice Governance

As the microservice architecture at Haodf expands, service‑to‑service calls become more complex, raising questions such as how to offline‑analyze unreasonable dependencies, quickly locate issues via trace analysis, control storage costs for billions of log entries, extract valuable metrics from massive data, perform real‑time alerts, choose storage engines for long‑term metrics, and build friendly visualizations.

Three core data‑analysis pillars are identified: Metrics, Tracing, and Logging. Tracing provides service topology and performance insights (APM). Logging includes both tracing logs and runtime logs (e.g., Nginx, Tomcat, custom logs). Metrics are derived mainly from tracing and logging data.

Initially, Haodf relied on an ELK stack for log storage and monitoring, but growing data volume caused Elasticsearch storage pressure, disk I/O bottlenecks, and increased real‑time query load. Offline Spark jobs became inflexible and costly, and the emerging cloud‑native environment further strained the legacy setup.

2. The Crossroads: Rejecting Hadoop/Spark

Cost, operational overhead, and team skill gaps led to rejecting Hadoop/Spark solutions. Hadoop‑based stacks are heavyweight, require many physical machines for HBase/Hive, and introduce high machine, O&M, and personnel costs. Moreover, the system‑architecture team prefers Go, while big‑data ecosystems are Java‑centric, creating Conway’s Law communication barriers.

3. ClickHouse Emerges as the Preferred OLAP Engine

Benchmarking a table with 133 columns on 10 million, 100 million, and 1 billion rows across 43 SQL queries showed ClickHouse outperforming Vertica (2.63×), InfiniDB (17×), MonetDB (27×), Hive (126×), MySQL (429×), and Greenplum (10×) at the 100 million row scale.

Key advantages of ClickHouse include:

Complete DBMS functionality (DDL/DML, permissions, backup, distributed management).

Columnar storage with high compression (up to 8:1 in Yandex.Metrica production).

Vectorized execution using SIMD, multi‑threading, and multi‑master architecture.

Rich table‑engine ecosystem (MergeTree, ReplacingMergeTree, SummingMergeTree, AggregatingMergeTree, CollapsingMergeTree, VersionedCollapsingMergeTree, GraphiteMergeTree, etc.).

Native sharding and distributed tables for horizontal scaling.

Secondary indexes and sparse indexes for fast point queries.

4. Practical ClickHouse Deployment in Haodf

Engine Selection : Distributed + MergeTree is used for most logs; GraphiteMergeTree handles time‑series metrics; ReplicatedMergeTree provides high availability.

Table Naming Convention : Local tables prefixed with local_ (e.g., local_metrics), distributed tables use the logical name (e.g., metrics).

TTL Management : Tables include a Timestamp column and set table‑level TTL (e.g., TTL Timestamp + toIntervalDay(30)) to automatically purge expired data. The merge_with_ttl_timeout defaults to 86400 s.

<log_nshards_0replicas>
    <shard>
        <replica>
            <host>clickhouse1</host>
            <port>9000</port>
            <user>user</user>
            <password>password</password>
        </replica>
    </shard>
    <shard>
        <replica>
            <host>clickhouse2</host>
            <port>9000</port>
            <user>user</user>
            <password>password</password>
        </replica>
    </shard>
</log_nshards_0replicas>

Partitioning & Sharding : Only MergeTree supports PARTITION BY. Time‑based partitions (daily or monthly) align with log TTL and improve query pruning. Sharding is implemented via the Distributed engine, which routes inserts and queries across shards.

Indexing : Primary key on timestamp for ordered scans; sparse index granularity set to 8192 rows; secondary (skip) indexes support fast filtering on high‑cardinality columns.

Configuration Parameters (examples):

max_concurrent_queries = 500 (default QPS < 100, increased for heavy aggregation).

Compression: default LZ4, achieving ~1/10 the storage size of Elasticsearch.

User management: default user default with empty password replaced by password_double_sha1_hex.

Query timeout: execution_time = 5 seconds, with optional max_memory_usage limits.

<web>
    <interval>
        <duration>3600</duration>
        <queries>500</queries>
        <errors>100</errors>
        <result_rows>100</result_rows>
        <read_rows>2000</read_rows>
        <execution_time>5</execution_time>
    </interval>
</web>
<web_user>
    <password_double_sha1_hex>***</password_double_sha1_hex>
    <networks incl="networks" replace="replace">
        <ip>::/0</ip>
    </networks>
    <profile>default</profile>
    <quota>web</quota>
    <access_management>1</access_management>
</web_user>

Visualization : Grafana with the ClickHouse data source visualizes real‑time dashboards, APM trace analysis, and alert metrics.

4.1 Example Table Definitions

CREATE TABLE {database}.{local_table} ON CLUSTER {cluster}
(
    `date` Date,
    `timestamp` DateTime,
    ...
)
ENGINE = MergeTree
PARTITION BY date
ORDER BY timestamp
TTL timestamp + toIntervalDay(30)
SETTINGS index_granularity = 8192;

CREATE TABLE {database}.{table} ON CLUSTER {cluster} AS {database}.{local_table}
ENGINE = Distributed({cluster}, {database}, {table}, rand());

4.2 APM Trace Query Example

SELECT (intDiv(toUInt32(timestamp), 1) * 1000) AS t,
       quantile(0.99)(consume) AS p99
FROM apm.trace_logs
WHERE timestamp >= toDateTime(1605147889)
  AND app_name = 'app'
GROUP BY t
ORDER BY t;

4.3 Materialized View for Middleware Logs

CREATE MATERIALIZED VIEW apm.local_middleware_logs (
    `date` Date,
    `timestamp` DateTime,
    `trace_id` String,
    `event` String,
    ...
) ENGINE = MergeTree
PARTITION BY date
ORDER BY timestamp
TTL timestamp + toIntervalDay(6)
AS SELECT date, timestamp, trace_id, event, ...
FROM apm.local_biz_logs
WHERE event IN ('mysql','rabbitmq');

CREATE TABLE apm.middleware_logs ON CLUSTER {cluster} AS apm.local_middleware_logs
ENGINE = Distributed({cluster}, apm, middleware_logs, rand());

5. Use Cases and Outcomes

Full‑Site Availability Dashboard : Real‑time queries over logs from Kong, tracing, custom business logs, and front‑end error logs. Example schema:

CREATE TABLE default.local_tests ON CLUSTER nshards_1replicas (
    `id` Int64,
    `diseaseid` Int64,
    `ctime` DateTime,
    ...
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/local_tests', '{replica}')
PARTITION BY toYYYYMMDD(ctime)
ORDER BY (ctime, diseaseid)
TTL ctime + toIntervalDay(7)
SETTINGS index_granularity = 8192;

Distributed table default.tests provides a sharded view for query routing.

APM Analytics : Tracing logs stored in Distributed + MergeTree enable fast P99 latency calculations and service‑dependency graphs using ClickHouse’s rich function set and sampling capabilities.

Alert Metric Storage : Metrics are ingested into a GraphiteMergeTree engine, enabling Prometheus remote‑write to ClickHouse for long‑term retention. Example metric record:

date: 2021-11-25
name: app_request_qpm
tags: ['app_name=demo','ttl=forever']
val: 336608
ts: 2021-11-26 23:00:00
updated: 2021-11-26 23:00:22

Querying recent 5‑minute traffic:

SELECT sum(val) AS total
FROM metrics
WHERE name = 'app_request_qpm'
  AND has(tags, 'app_name=demo')
  AND date = today() AND ts > now() - 300;

6. Outlook

Automate ClickHouse cluster management to reduce manual O&M and improve failover.

Expand ClickHouse usage to more business domains and integrate with PaaS platforms.

Deepen understanding of ClickHouse internals to move away from black‑box operations.

Apply chaos engineering to validate stability of the ClickHouse middleware.

7. Conclusion

In the era of big data, OLAP databases like ClickHouse become essential for microservice governance, offering real‑time, multi‑dimensional analytics with low storage cost and high query performance. The case study demonstrates how ClickHouse replaced Elasticsearch and Spark, delivering sub‑second query latency on billions of rows while supporting diverse monitoring, APM, and alerting scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring ClickHouse OLAP DatabaseDesign DataAnalytics PerformanceBenchmark

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.