Databases 14 min read

How ClickHouse Outperformed Elasticsearch for Ctrip’s Log Analytics: Speed, Cost & Ops

Facing growing log volumes and high Elasticsearch costs, Ctrip migrated its 200 TB daily logs to ClickHouse, achieving up to 38× faster queries, 60 % lower server resources, and simplified operations through columnar storage, sharding, and custom dashboards, while detailing deployment, tuning, and common pitfalls.

dbaplus Community
dbaplus Community
dbaplus Community
How ClickHouse Outperformed Elasticsearch for Ctrip’s Log Analytics: Speed, Cost & Ops

Background and Motivation

ElasticSearch, a Lucene‑based distributed full‑text search engine, was used by Ctrip to process over 200 TB of daily logs across more than 500 servers. As log volume grew, the company faced rising hardware costs, high write latency, slow or failed queries, and increasing operational overhead.

Why ClickHouse?

ClickHouse is a high‑performance column‑oriented distributed DBMS. Tests showed several advantages:

Write throughput : 50‑200 MB/s per server (over 600 k records/s), >5× ES write speed, reducing write rejections and latency.

Query speed : 2‑30 GB/s from page cache, 5‑30× faster than ES in tests.

Cost efficiency : Data compression 1/3‑1/30 of ES, lower disk I/O, less memory and CPU usage, estimated 50% server cost reduction.

Stability : Better load balancing via shards, built‑in query limits to avoid OOM, and native hot‑cold partitioning.

SQL syntax : Simpler than ES DSL, lower learning curve.

ClickHouse High‑Availability Deployment

The cluster uses multiple shards with 2 replicas each, coordinated by Zookeeper. Two cluster sizes (6‑node and 20‑node) support different log scales. Cross‑IDC deployment replicates a full ClickHouse cluster in each data center and creates a distributed table that spans all IDC clusters, enabling seamless cross‑region queries.

ClickHouse HA architecture
ClickHouse HA architecture
Cross‑IDC ClickHouse deployment
Cross‑IDC ClickHouse deployment

Key Configuration Parameters

max_threads = 32 # limits per‑user query threads

max_memory_usage = 10000000000 # ~9.3 GB per query

max_execution_time = 30 # seconds

skip_unavailable_shards = 1 # continue query if a shard is down

Data Ingestion Practices

Data is consumed into ClickHouse via gohangout. Recommendations include:

Round‑robin writes across all nodes for even distribution.

Batch writes at low frequency (e.g., 100 k records or every 30 s) to reduce part count and avoid “Too many parts”.

Write to local tables, not distributed tables, to limit network traffic and merge overhead.

Use daily partitions; avoid timestamp‑based partitions that cause excessive parts.

Properly configure primary keys, indexes, and handle out‑of‑order data to maintain write performance.

Dashboard Migration and Visualization

Kibana 3 was extended to query ClickHouse directly, preserving familiar visualizations (terms, histogram, percentiles, ranges, table). A custom migration tool rewrites Kibana dashboard configurations to point to ClickHouse.

Kibana‑ClickHouse dashboard
Kibana‑ClickHouse dashboard

Performance Results

In production (≈100 TB raw logs, ≈600 TB compressed), ClickHouse showed:

Memory usage far lower than ES because most data stays on disk.

Disk space savings of 60‑80 % (e.g., Netflow logs occupy 32 % of ES size).

Query speed improvements of 4.4×‑38× depending on workload.

Performance comparison charts
Performance comparison charts

Operational Practices

Typical operational tasks include:

Onboarding new log sources and performance tuning.

Daily partition cleanup for expired logs.

Monitoring via ClickHouse‑exporter, VictoriaMetrics, and Grafana.

Data migration using distributed tables; when needed, ClickHouse_copier or manual replication.

Common Issues and Mitigations

Slow queries : terminate with KILL QUERY and apply the tuning measures described.

“Too many parts” : caused by excessive small batches, writes to distributed tables, or insufficient merge threads; resolve by adjusting batch size, writing to local tables, and increasing merge thread count.

Startup failures : may stem from corrupted file systems or malformed table data; fix by repairing the FS or moving problematic parts to the detached directory.

Conclusion

Replacing Elasticsearch with ClickHouse reduced server resources, cut operational costs, and dramatically accelerated log query performance, especially during incident investigations. While ClickHouse is not a universal replacement for all ES use‑cases, its columnar architecture proves highly effective for large‑scale log analytics, and further exploration can extend its value to additional domains.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance OptimizationElasticsearchClickHousedatabase migrationLog Analytics
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.