How ClickHouse Outperformed Elasticsearch for Ctrip’s Log Analytics: Speed, Cost & Ops
Facing growing log volumes and high Elasticsearch costs, Ctrip migrated its 200 TB daily logs to ClickHouse, achieving up to 38× faster queries, 60 % lower server resources, and simplified operations through columnar storage, sharding, and custom dashboards, while detailing deployment, tuning, and common pitfalls.
Background and Motivation
ElasticSearch, a Lucene‑based distributed full‑text search engine, was used by Ctrip to process over 200 TB of daily logs across more than 500 servers. As log volume grew, the company faced rising hardware costs, high write latency, slow or failed queries, and increasing operational overhead.
Why ClickHouse?
ClickHouse is a high‑performance column‑oriented distributed DBMS. Tests showed several advantages:
Write throughput : 50‑200 MB/s per server (over 600 k records/s), >5× ES write speed, reducing write rejections and latency.
Query speed : 2‑30 GB/s from page cache, 5‑30× faster than ES in tests.
Cost efficiency : Data compression 1/3‑1/30 of ES, lower disk I/O, less memory and CPU usage, estimated 50% server cost reduction.
Stability : Better load balancing via shards, built‑in query limits to avoid OOM, and native hot‑cold partitioning.
SQL syntax : Simpler than ES DSL, lower learning curve.
ClickHouse High‑Availability Deployment
The cluster uses multiple shards with 2 replicas each, coordinated by Zookeeper. Two cluster sizes (6‑node and 20‑node) support different log scales. Cross‑IDC deployment replicates a full ClickHouse cluster in each data center and creates a distributed table that spans all IDC clusters, enabling seamless cross‑region queries.
Key Configuration Parameters
max_threads = 32 # limits per‑user query threads
max_memory_usage = 10000000000 # ~9.3 GB per query
max_execution_time = 30 # seconds
skip_unavailable_shards = 1 # continue query if a shard is down
Data Ingestion Practices
Data is consumed into ClickHouse via gohangout. Recommendations include:
Round‑robin writes across all nodes for even distribution.
Batch writes at low frequency (e.g., 100 k records or every 30 s) to reduce part count and avoid “Too many parts”.
Write to local tables, not distributed tables, to limit network traffic and merge overhead.
Use daily partitions; avoid timestamp‑based partitions that cause excessive parts.
Properly configure primary keys, indexes, and handle out‑of‑order data to maintain write performance.
Dashboard Migration and Visualization
Kibana 3 was extended to query ClickHouse directly, preserving familiar visualizations (terms, histogram, percentiles, ranges, table). A custom migration tool rewrites Kibana dashboard configurations to point to ClickHouse.
Performance Results
In production (≈100 TB raw logs, ≈600 TB compressed), ClickHouse showed:
Memory usage far lower than ES because most data stays on disk.
Disk space savings of 60‑80 % (e.g., Netflow logs occupy 32 % of ES size).
Query speed improvements of 4.4×‑38× depending on workload.
Operational Practices
Typical operational tasks include:
Onboarding new log sources and performance tuning.
Daily partition cleanup for expired logs.
Monitoring via ClickHouse‑exporter, VictoriaMetrics, and Grafana.
Data migration using distributed tables; when needed, ClickHouse_copier or manual replication.
Common Issues and Mitigations
Slow queries : terminate with KILL QUERY and apply the tuning measures described.
“Too many parts” : caused by excessive small batches, writes to distributed tables, or insufficient merge threads; resolve by adjusting batch size, writing to local tables, and increasing merge thread count.
Startup failures : may stem from corrupted file systems or malformed table data; fix by repairing the FS or moving problematic parts to the detached directory.
Conclusion
Replacing Elasticsearch with ClickHouse reduced server resources, cut operational costs, and dramatically accelerated log query performance, especially during incident investigations. While ClickHouse is not a universal replacement for all ES use‑cases, its columnar architecture proves highly effective for large‑scale log analytics, and further exploration can extend its value to additional domains.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
