Operations 12 min read

How We Built a Scalable Real‑Time Log Center with ClickHouse and ELK

Facing massive data volumes, the team at Kuaidi100 redesigned their logging platform, moving from a file‑based system to an ELK stack and finally to a ClickHouse‑based architecture, achieving real‑time, scalable, cost‑effective log collection, analysis, and alerting while addressing storage, performance, and maintenance challenges.

dbaplus Community
dbaplus Community
dbaplus Community
How We Built a Scalable Real‑Time Log Center with ClickHouse and ELK

Project Background

Kuaidi100 serves over 2.5 billion end‑users and more than 1.3 million couriers and partners, generating billions of daily queries and hundreds of thousands of orders. A real‑time, highly scalable log center with strong search and analytics was required for operational insight and rapid issue resolution.

Initial Architecture

The original solution stored plain log files on individual machines. Major drawbacks were:

Manual login to multiple servers to view logs.

High I/O pressure when using tail or cat for searching.

Large log files causing slow queries, disk alerts and storage exhaustion.

Unstructured log formats with virtually no readability or analysability.

Performance‑degrading NFS mounts and risk of log loss.

ELK Stack Adoption

In 2017 the team migrated to an ELK (Elasticsearch‑Logstash‑Kibana) architecture, using JSON‑formatted logs, full‑text search and Kibana visualizations, which dramatically improved log accessibility and search speed.

ELK architecture diagram
ELK architecture diagram

Challenges with ELK

After months of use several limitations emerged:

High storage cost due to low compression; six‑month retention required massive disk space.

Write‑throughput bottlenecks caused by Elasticsearch tokenization.

Excessive memory consumption.

Complex TTL management and manual data expiration.

Insufficient aggregation performance for growing analytical needs.

Migration to ClickHouse

In 2020 the team evaluated ClickHouse as a replacement for Elasticsearch. Benchmarks showed superior compression (ZSTD) and query speed, leading to the decision to adopt ClickHouse for log storage.

New Architecture Overview

The redesigned pipeline consists of four layers:

Collection Layer: Replaced Logstash with ilogtail, offering higher performance and lower resource usage.

Processing Layer: ilogtail adds data masking, multiline splitting and other useful functions.

Storage Layer: Switched from Elasticsearch to ClickHouse, benefiting from high compression and fast reads.

Visualization: Replaced Kibana with ClickVisual (supplemented by Grafana) for log querying and alerting.

New architecture diagram
New architecture diagram

Performance Results

Testing with 1 billion log entries demonstrated:

ClickHouse’s ZSTD compression reduced disk usage enough to retain six months of data on the same hardware that previously held only one month with Elasticsearch.

Kafka consumption speed increased markedly (benchmark charts shown below).

Disk usage comparison
Disk usage comparison
Kafka consumption speed
Kafka consumption speed

Storage Optimizations

Key techniques applied to the ClickHouse tables:

Use ZSTD compression for most fields.

Apply LowCardinality types to reduce size and improve performance.

Delta + ZSTD compression for continuous timestamp fields.

Hot‑cold tiering: recent data on SSD, older data on HDD, automatic cleanup after six months.

Example table definition (image):

ClickHouse table DDL
ClickHouse table DDL

ClickVisual Visualization

ClickVisual is an open‑source, lightweight platform that natively supports ClickHouse. It provides:

Visual query panels with hit‑count histograms and raw log view.

Log index statistics.

Proxy authentication for easy integration.

Real‑time alerting based on ClickHouse logs.

ClickVisual UI
ClickVisual UI

It also offers a raw SQL query interface for ad‑hoc aggregation.

ClickVisual SQL query
ClickVisual SQL query

Further Optimizations

Additional refinements addressed specific query scenarios:

Trace‑ID queries using tokenbf_v1 index with hasToken for fast lookup.

Inverted indexes for unstructured logs, dramatically speeding up LIKE searches.

Projection feature for common aggregations.

ClickHouse Configuration Limits

To prevent runaway queries and OOM situations, the team tuned limits in users.xml: max_memory_usage – maximum memory per query. max_memory_usage_for_user – maximum memory per user. max_memory_usage_for_all_queries – maximum memory for all concurrent queries. max_rows_to_read – maximum rows a query may read. max_result_rows – limit on rows returned. max_bytes_to_read – maximum uncompressed bytes a query may read.

Conclusion

The migration from a file‑based system to ELK and finally to a ClickHouse‑driven log center delivered real‑time, scalable, and cost‑effective logging for Kuaidi100. The new platform improved issue‑location speed, system stability, and provided richer insights into user behavior, enabling better product and operational decisions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ClickHousescalable architectureELKLog Management
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.