Using ClickHouse for Real‑Time Log Analytics and Data Storage in Microservice Governance at Haodf
The article describes how Haodf's SRE team replaced Elasticsearch with ClickHouse to handle massive microservice logs, achieve low‑latency queries, reduce storage costs, and support real‑time monitoring, tracing, and metric analysis through columnar OLAP features, sharding, TTL, and materialized views.
Haodf's SRE team faced performance bottlenecks as microservice governance generated massive logs that Elasticsearch could no longer query in real time. The growing volume of tracing, logging, and metric data required a storage solution that could support high‑throughput writes, low‑latency reads, and efficient compression.
After evaluating Hadoop/Spark, HBase, and other big‑data stacks, the team chose ClickHouse because it offers columnar storage, vectorized execution, multi‑node parallelism, and a rich function library, making it ideal for OLAP workloads on structured log data.
Key ClickHouse Advantages
Complete DBMS features: DDL/DML, permissions, backup, distributed management.
Columnar storage with high compression (up to 8:1 in production).
Vectorized engine and SIMD for high CPU utilization.
Multiple table engines (MergeTree, ReplacingMergeTree, SummingMergeTree, AggregatingMergeTree, GraphiteMergeTree, etc.) and support for replicated and distributed tables.
Multi‑master architecture eliminates single‑point failures.
ClickHouse also has some limitations: no transactions, not suited for high‑concurrency OLTP workloads, and limited support for heavy write‑heavy key‑value use cases.
Typical Use Cases at Haodf
Real‑time service availability dashboards (KongLog, TraceLog, LocalLog, etc.).
APM analysis of service latency, error rates, and request paths.
Metric storage for Prometheus remote write, using GraphiteMergeTree.
Long‑term archival of important metrics with TTL="forever" and roll‑up policies.
All scenarios require high write throughput (billions of rows per day), low‑latency queries (sub‑second), and efficient storage.
Schema Design and Table Creation
CREATE TABLE {database}.{local_table} ON CLUSTER {cluster}
(
`date` Date,
`timestamp` DateTime,
...
)
ENGINE = MergeTree
PARTITION BY date
ORDER BY timestamp
TTL timestamp + toIntervalDay(30)
SETTINGS index_granularity = 8192;
CREATE TABLE {database}.{table} ON CLUSTER {cluster}
AS {database}.{local_table}
ENGINE = Distributed({cluster}, {database}, {table}, rand());For tracing logs, a materialized view extracts level‑1 traces to reduce scan size:
CREATE MATERIALIZED VIEW apm.local_entrances
(
`date` Date,
`timestamp` DateTime,
`trace_id` String,
...
)
ENGINE = MergeTree
PARTITION BY date
ORDER BY timestamp
TTL timestamp + toIntervalDay(6)
AS
SELECT date, timestamp, trace_id, ...
FROM apm.local_trace_logs
WHERE level = '1';Distributed tables expose the data to queries, while local tables store the actual data.
Sharding and Replication Strategy
The production cluster uses a zero‑replica sharding mode for cost efficiency, with the option to add replicas for high availability. Shard configuration is stored in /etc/metrika.xml and Zookeeper macros are used for dynamic placement.
<log_nshards_0replicas>
<shard>
<replica>
<host>clickhouse1</host>
<port>9000</port>
<user>user</user>
<password>password</password>
</replica>
<replica>
<host>clickhouse2</host>
<port>9000</port>
<user>user</user>
<password>password</password>
</replica>
</shard>
</log_nshards_0replicas>Performance Benchmarks
In single‑node tests, ClickHouse outperformed Vertica (2.63×), InfiniDB (17×), MonetDB (27×), Hive (126×), MySQL (429×), and Greenplum (10×) on a 100 million‑row dataset, confirming its suitability for Haodf's scale.
Operational Settings
max_concurrent_queries set to 500 to handle high QPS.
lz4 compression reduces storage to ~10 % of Elasticsearch.
User management with hashed passwords and role‑based access.
Query timeout (execution_time=5) and memory limits to prevent overload.
Future Work
Plans include building a ClickHouse management platform, expanding usage to more business units, improving observability of the ClickHouse cluster, and applying chaos engineering for reliability testing.
Overall, ClickHouse proved to be a high‑performance, cost‑effective solution for real‑time analytics and long‑term storage of microservice governance data at Haodf.
HaoDF Tech Team
HaoDF Online tech practice and sharing—join us to discuss and help create quality healthcare through technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.