Migrating Log Processing from Elasticsearch to ClickHouse: Architecture, Deployment, Optimization, and Benefits
This article details Ctrip's migration of large‑scale log processing from Elasticsearch to ClickHouse, explaining why ClickHouse was chosen, the high‑availability deployment architecture, data ingestion strategies, dashboard integration, performance gains, operational practices, and overall cost and reliability improvements.
ElasticSearch, a distributed full‑text search engine, has been used by Ctrip to handle over 200 TB of daily logs across more than 500 servers, but growing costs, latency, and operational complexity prompted a search for alternatives.
Why ClickHouse? ClickHouse is a high‑performance columnar distributed DBMS offering significantly higher write throughput (50‑200 MB/s per server, >600 k records/s), 5‑30× faster queries, lower storage (1/3‑1/30 of ES), reduced memory and CPU usage, and better stability through sharding and partitioning.
The log format already fits ClickHouse tables, and most queries are aggregations that align with column‑store strengths, making ClickHouse a suitable replacement for the majority of Ctrip's log workloads.
High‑availability deployment uses multiple shards with two replicas each, coordinated via Zookeeper, allowing a shard to lose a node without data loss. Two cluster sizes (6‑node and 20‑node) are deployed to accommodate different log volumes, and cross‑IDC clusters are built using distributed tables.
Key configuration parameters include:
max_threads: 32 # max query threads per user max_memory_usage: 10000000000 # ~9.31 GB per query max_execution_time: 30 # seconds skip_unavailable_shards: 1 # continue query if a shard is downData ingestion is performed with gohangout , employing round‑robin writes, large batch low‑frequency inserts, avoiding distributed tables for writes, daily partitioning by day, and careful primary‑key and index design to prevent slowdowns.
For visualization, Kibana 3 was extended to support ClickHouse, reproducing common chart types (terms, histogram, percentiles, ranges, table) with comparable user experience but dramatically faster query performance.
Query optimizations include splitting large table panel queries into two steps (estimate row count, then fetch detailed rows), using approximate calculations, materialized views/columns, and limiting result sets, achieving up to 60× faster response times and reducing data processed by up to 1/120.
A migration tool was built to adjust Kibana dashboard configurations for ClickHouse compatibility.
Operational results show that a single ClickHouse cluster handling ~100 TB of logs (≈600 TB uncompressed) uses far less memory and disk space than Elasticsearch—up to 60 % disk savings and 4.4‑38× query speed improvements—while cutting server resource needs by roughly half.
Basic ClickHouse operations are simpler than ES, covering new log ingestion, performance tuning, partition cleanup, monitoring via ClickHouse‑exporter + VictoriaMetrics + Grafana, and data migration using distributed tables or ClickHouse‑copier.
Common issues such as slow queries, “Too many parts” errors, and startup failures are addressed with configuration adjustments, batch sizing, avoiding distributed‑table writes, and filesystem or table repair procedures.
In summary, migrating logs from Elasticsearch to ClickHouse reduces infrastructure costs, lowers operational overhead, and dramatically improves query latency, enhancing user experience during incident investigations, while acknowledging that Elasticsearch remains indispensable for certain use cases.
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.