Operations 12 min read

How We Built a Scalable Cloud‑Native Log Center with ClickHouse

This article details a courier company's evolution from a simple file‑based logging system to a cloud‑native log center, describing the limitations of the original architecture, the migration to an ELK stack, subsequent challenges, and the final redesign using ClickHouse for high compression, low cost, and improved query performance.

Efficient Ops
Efficient Ops
Efficient Ops
How We Built a Scalable Cloud‑Native Log Center with ClickHouse

Background

After thirteen years of development, a courier company now has over 250 million registered C‑end users, more than 1.3 million professional couriers and site operators, and over 2.5 million B‑end customers. Daily query volume exceeds 400 million calls and order volume exceeds 300 thousand. The massive and complex data requires a real‑time, scalable log center with strong search and analysis capabilities to support performance monitoring, business insights, and issue troubleshooting.

Initial Architecture

1. Original Architecture

The early log system relied on raw log files stored on individual machines, leading to several problems:

Inconvenient access: each log file required logging into different machines.

High I/O pressure: tail or cat commands caused heavy I/O and potential failures.

Large files caused slow queries, disk alerts, and space exhaustion.

Unstructured format resulted in near‑zero readability and analysability.

Multi‑node NFS mounts performed poorly and risked log loss.

2. ELK Stack

In 2017 the team adopted an Elasticsearch‑based ELK stack, using JSON for log storage, full‑text search, and Kibana for visualization. This addressed many of the earlier issues, providing faster search, better aggregation, and a more user‑friendly UI.

Challenges with ELK

After a period of use, new problems emerged as log‑driven analysis and alerting grew:

Cost: low compression rate of Elasticsearch required huge storage for six‑month retention.

Throughput bottleneck: Elasticsearch tokenization limited write speed.

High memory consumption.

Lifecycle maintenance: older ES versions lacked TTL, requiring manual data expiration.

Limited aggregation capabilities for advanced analytics.

In 2020 the team evaluated ClickHouse against Elasticsearch and chose ClickHouse for the next‑generation log storage due to its superior compression and query performance.

New Architecture

The new design replaces Elasticsearch with ClickHouse while keeping a similar pipeline:

Collection layer: Logstash is replaced by ilogtail, offering higher performance and lower resource usage.

Processing layer: ilogtail adds data desensitization and multiline splitting.

Storage layer: ClickHouse replaces Elasticsearch.

Visualization: Kibana is replaced by ClickVisual, supplemented by Grafana for comparable dashboards.

<code>ClickVisual advantages: flexible SQL, log audit, alert policies;</code>
<code>Kibana advantages: basic BI functions for log analysis.</code>

Results of the New Architecture

Testing with one billion log entries showed dramatically lower disk usage and higher compression, allowing six‑month retention where Elasticsearch could only keep one month. Optimizations include:

Most fields use ZSTD compression.

LowCardinality types reduce storage and improve performance.

Delta+ZSTD compression for continuous time fields.

Hot‑cold tiering: recent data on SSD, older data on HDD, automatic cleanup after six months.

Example table creation script (image omitted) demonstrates the schema.

ClickVisual Visualization

ClickVisual is an open‑source lightweight platform for log query, analysis, and alerting, supporting ClickHouse as a backend. It offers query panels, index statistics, proxy authentication, and real‑time alerting, plus direct SQL aggregation.

Further Optimizations

1. Log Query Optimization

ClickHouse’s high compression and query speed enable time‑partitioned searches for small tables, but large volumes still require tuning. Optimizations include:

TraceID queries using tokenbf_v1 index and hasToken for fast hits.

Inverted index for unstructured logs to avoid slow LIKE scans.

Projection feature for common aggregation scenarios.

2. Local vs Distributed Tables

For high‑frequency writes, local tables are preferred because writing to distributed tables adds Zookeeper overhead, network load, and can cause “Too many parts” issues and consistency problems.

3. ClickHouse Limiting Strategies

To prevent runaway queries, several limits are configured in

users.xml

:

max_memory_usage

max_memory_usage_for_user

max_memory_usage_for_all_queries

max_rows_to_read

max_result_rows

max_bytes_to_read

Conclusion

The article shares the courier company's cloud‑native practice of building a log center, highlighting how the new ClickHouse‑based architecture improved stability, observability, and cost efficiency, while enabling deeper user behavior analysis and supporting rapid product iteration.

cloud nativeObservabilityClickHouseELKlog managementdata compression
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.