How We Built a Scalable Real‑Time Log Center with ClickHouse and ELK
Facing massive data volumes, the team at Kuaidi100 redesigned their logging platform, moving from a file‑based system to an ELK stack and finally to a ClickHouse‑based architecture, achieving real‑time, scalable, cost‑effective log collection, analysis, and alerting while addressing storage, performance, and maintenance challenges.
Project Background
Kuaidi100 serves over 2.5 billion end‑users and more than 1.3 million couriers and partners, generating billions of daily queries and hundreds of thousands of orders. A real‑time, highly scalable log center with strong search and analytics was required for operational insight and rapid issue resolution.
Initial Architecture
The original solution stored plain log files on individual machines. Major drawbacks were:
Manual login to multiple servers to view logs.
High I/O pressure when using tail or cat for searching.
Large log files causing slow queries, disk alerts and storage exhaustion.
Unstructured log formats with virtually no readability or analysability.
Performance‑degrading NFS mounts and risk of log loss.
ELK Stack Adoption
In 2017 the team migrated to an ELK (Elasticsearch‑Logstash‑Kibana) architecture, using JSON‑formatted logs, full‑text search and Kibana visualizations, which dramatically improved log accessibility and search speed.
Challenges with ELK
After months of use several limitations emerged:
High storage cost due to low compression; six‑month retention required massive disk space.
Write‑throughput bottlenecks caused by Elasticsearch tokenization.
Excessive memory consumption.
Complex TTL management and manual data expiration.
Insufficient aggregation performance for growing analytical needs.
Migration to ClickHouse
In 2020 the team evaluated ClickHouse as a replacement for Elasticsearch. Benchmarks showed superior compression (ZSTD) and query speed, leading to the decision to adopt ClickHouse for log storage.
New Architecture Overview
The redesigned pipeline consists of four layers:
Collection Layer: Replaced Logstash with ilogtail, offering higher performance and lower resource usage.
Processing Layer: ilogtail adds data masking, multiline splitting and other useful functions.
Storage Layer: Switched from Elasticsearch to ClickHouse, benefiting from high compression and fast reads.
Visualization: Replaced Kibana with ClickVisual (supplemented by Grafana) for log querying and alerting.
Performance Results
Testing with 1 billion log entries demonstrated:
ClickHouse’s ZSTD compression reduced disk usage enough to retain six months of data on the same hardware that previously held only one month with Elasticsearch.
Kafka consumption speed increased markedly (benchmark charts shown below).
Storage Optimizations
Key techniques applied to the ClickHouse tables:
Use ZSTD compression for most fields.
Apply LowCardinality types to reduce size and improve performance.
Delta + ZSTD compression for continuous timestamp fields.
Hot‑cold tiering: recent data on SSD, older data on HDD, automatic cleanup after six months.
Example table definition (image):
ClickVisual Visualization
ClickVisual is an open‑source, lightweight platform that natively supports ClickHouse. It provides:
Visual query panels with hit‑count histograms and raw log view.
Log index statistics.
Proxy authentication for easy integration.
Real‑time alerting based on ClickHouse logs.
It also offers a raw SQL query interface for ad‑hoc aggregation.
Further Optimizations
Additional refinements addressed specific query scenarios:
Trace‑ID queries using tokenbf_v1 index with hasToken for fast lookup.
Inverted indexes for unstructured logs, dramatically speeding up LIKE searches.
Projection feature for common aggregations.
ClickHouse Configuration Limits
To prevent runaway queries and OOM situations, the team tuned limits in users.xml: max_memory_usage – maximum memory per query. max_memory_usage_for_user – maximum memory per user. max_memory_usage_for_all_queries – maximum memory for all concurrent queries. max_rows_to_read – maximum rows a query may read. max_result_rows – limit on rows returned. max_bytes_to_read – maximum uncompressed bytes a query may read.
Conclusion
The migration from a file‑based system to ELK and finally to a ClickHouse‑driven log center delivered real‑time, scalable, and cost‑effective logging for Kuaidi100. The new platform improved issue‑location speed, system stability, and provided richer insights into user behavior, enabling better product and operational decisions.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
