Big Data 25 min read

Using ClickHouse for Real‑Time Log Analytics and Data Storage in Microservice Governance at Haodf

The article describes how Haodf's SRE team replaced Elasticsearch with ClickHouse to handle massive microservice logs, achieve low‑latency queries, reduce storage costs, and support real‑time monitoring, tracing, and metric analysis through columnar OLAP features, sharding, TTL, and materialized views.

HaoDF Tech Team

Jan 11, 2022

Using ClickHouse for Real‑Time Log Analytics and Data Storage in Microservice Governance at Haodf

Haodf's SRE team faced performance bottlenecks as microservice governance generated massive logs that Elasticsearch could no longer query in real time. The growing volume of tracing, logging, and metric data required a storage solution that could support high‑throughput writes, low‑latency reads, and efficient compression.

After evaluating Hadoop/Spark, HBase, and other big‑data stacks, the team chose ClickHouse because it offers columnar storage, vectorized execution, multi‑node parallelism, and a rich function library, making it ideal for OLAP workloads on structured log data.

Key ClickHouse Advantages

Complete DBMS features: DDL/DML, permissions, backup, distributed management.

Columnar storage with high compression (up to 8:1 in production).

Vectorized engine and SIMD for high CPU utilization.

Multiple table engines (MergeTree, ReplacingMergeTree, SummingMergeTree, AggregatingMergeTree, GraphiteMergeTree, etc.) and support for replicated and distributed tables.

Multi‑master architecture eliminates single‑point failures.

ClickHouse also has some limitations: no transactions, not suited for high‑concurrency OLTP workloads, and limited support for heavy write‑heavy key‑value use cases.

Typical Use Cases at Haodf

Real‑time service availability dashboards (KongLog, TraceLog, LocalLog, etc.).

APM analysis of service latency, error rates, and request paths.

Metric storage for Prometheus remote write, using GraphiteMergeTree.

Long‑term archival of important metrics with TTL="forever" and roll‑up policies.

All scenarios require high write throughput (billions of rows per day), low‑latency queries (sub‑second), and efficient storage.

Schema Design and Table Creation

CREATE TABLE {database}.{local_table} ON CLUSTER {cluster}
(
    `date` Date,
    `timestamp` DateTime,
    ...
)
ENGINE = MergeTree
PARTITION BY date
ORDER BY timestamp
TTL timestamp + toIntervalDay(30)
SETTINGS index_granularity = 8192;

CREATE TABLE {database}.{table} ON CLUSTER {cluster}
AS {database}.{local_table}
ENGINE = Distributed({cluster}, {database}, {table}, rand());

For tracing logs, a materialized view extracts level‑1 traces to reduce scan size:

CREATE MATERIALIZED VIEW apm.local_entrances
(
    `date` Date,
    `timestamp` DateTime,
    `trace_id` String,
    ...
)
ENGINE = MergeTree
PARTITION BY date
ORDER BY timestamp
TTL timestamp + toIntervalDay(6)
AS
SELECT date, timestamp, trace_id, ...
FROM apm.local_trace_logs
WHERE level = '1';

Distributed tables expose the data to queries, while local tables store the actual data.

Sharding and Replication Strategy

The production cluster uses a zero‑replica sharding mode for cost efficiency, with the option to add replicas for high availability. Shard configuration is stored in /etc/metrika.xml and Zookeeper macros are used for dynamic placement.

<log_nshards_0replicas>
    <shard>
        <replica>
            <host>clickhouse1</host>
            <port>9000</port>
            <user>user</user>
            <password>password</password>
        </replica>
        <replica>
            <host>clickhouse2</host>
            <port>9000</port>
            <user>user</user>
            <password>password</password>
        </replica>
    </shard>
</log_nshards_0replicas>

Performance Benchmarks

In single‑node tests, ClickHouse outperformed Vertica (2.63×), InfiniDB (17×), MonetDB (27×), Hive (126×), MySQL (429×), and Greenplum (10×) on a 100 million‑row dataset, confirming its suitability for Haodf's scale.

Operational Settings

max_concurrent_queries set to 500 to handle high QPS.

lz4 compression reduces storage to ~10 % of Elasticsearch.

User management with hashed passwords and role‑based access.

Query timeout (execution_time=5) and memory limits to prevent overload.

Future Work

Plans include building a ClickHouse management platform, expanding usage to more business units, improving observability of the ClickHouse cluster, and applying chaos engineering for reliability testing.

Overall, ClickHouse proved to be a high‑performance, cost‑effective solution for real‑time analytics and long‑term storage of microservice governance data at Haodf.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Analytics Big Data microservices Observability ClickHouse OLAP

Written by

HaoDF Tech Team

HaoDF Online tech practice and sharing—join us to discuss and help create quality healthcare through technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.