Operations 10 min read

Building a Scalable ELK Log Analytics Platform at Ctrip: Lessons Learned

This article recounts how Ctrip’s operations team replaced costly commercial tools with an open‑source ELK stack, detailing requirements, architecture, optimizations, and the impressive scale achieved for real‑time log analysis across thousands of servers.

ITFLY8 Architecture Home
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Building a Scalable ELK Log Analytics Platform at Ctrip: Lessons Learned

Origin

Logs may look like simple text, but for site‑operations engineers they are a treasure trove. Typical tasks that involve logs include system health monitoring, fault root‑cause analysis, bottleneck diagnosis, and security issue tracking.

Skilled Linux SAs can quickly combine commands such as grep and awk to extract useful information, while developers often build custom storage and analysis tools on MySQL, MongoDB, or HBase.

However, the massive, distributed nature of the Internet makes log sources increasingly scattered and fast‑producing, rendering traditional tools ineffective. The market demand for new tools gave rise to commercial solutions like Splunk.

Since the establishment of Ctrip’s website operations center in 2013, a centralized log‑analysis platform has been on the agenda. As China’s largest OTA, Ctrip generates dozens of log types amounting to several terabytes daily. Using commercial software like Splunk would cost nearly ten million RMB per year, prompting the search for an alternative.

Initial Attempts

The frontline operations team expected a log‑analysis tool to:

Support multiple data sources.

Offer flexible yet simple parsing.

Provide keyword search, browsing, and combined‑condition queries.

Allow numeric aggregation over arbitrary time windows, e.g., average response time or most‑frequent error URLs.

Existing MySQL/HBase‑based tools stored logs well but were poor at retrieval; complex queries were extremely slow or impossible, leading to a bad user experience.

After exploring options, the team discovered the ELK stack—Elasticsearch, Logstash, and Kibana. Elasticsearch provides a distributed search engine, Logstash handles flexible log collection, filtering, and forwarding, and Kibana offers a powerful front‑end visualization panel.

Typical ELK architecture:

Logstash acts like a Swiss‑army knife, ingesting logs from various plugins, applying filters, and outputting to Elasticsearch for indexing. Redis serves as a message queue, improving fault tolerance by buffering logs when Elasticsearch is down.

The team quickly deployed a five‑node test cluster, feeding wireless and VDI logs, and visualized them in Kibana.

Example OpenStack log analysis dashboard:

Elasticsearch, built on Lucene, indexes all log fields with a compact inverted index, delivering high‑speed search. Its aggregation module enables fast distributed calculations, allowing Kibana users to combine filters, adjust time windows, and obtain results in seconds—eliminating the need for Hadoop or Spark for ad‑hoc analysis.

Continuous Learning, Practice and Optimization

Open‑source software incurs costs if its core principles are not understood. As log volume grew and the cluster expanded, stability and performance issues emerged, driving extensive research and optimization:

Developed an ES monitoring plugin for Ganglia.

Created an open‑source authentication gateway ESProxy to integrate Kibana with corporate SSO and provide index‑level ACLs.

Built Hangout, an open‑source Logstash replacement, boosting throughput five‑fold.

Implemented a Logstash Forwarder‑based agent to solve high‑resource consumption on Windows log agents.

Added navigation panels, comparative ranking panels, and time‑shift features to improve Kibana usability.

Replaced Redis with Kafka for a more horizontally scalable, fault‑tolerant transport pipeline.

Adopted Doc Values in ES 1.x early to accelerate large‑scale aggregations and improve stability.

Implemented hot‑cold data separation and automatic migration to preserve write latency under heavy query loads.

Separated client and data nodes to enhance cluster stability.

Evolution of the production cluster architecture:

Results and Future Outlook

After more than a year of effort, the ES cluster grew to 40 data nodes, ingesting over 50 log types, processing 16 billion log entries per day (≈5 TB), with stable operation and rapid response. It now serves as a core application for Ctrip’s operations, supporting ops, system development, security, and application teams.

Beyond log analysis, Elasticsearch’s distributed search capabilities are being leveraged for site‑wide search and recommendation services. The next internal project, the next‑generation distributed monitoring platform HickWall, will also be built on Elasticsearch, aiming for superior horizontal scalability and fast numeric aggregation compared to traditional time‑series stores.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ElasticsearchELKLog Analytics
ITFLY8 Architecture Home
Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.