Operations 12 min read

Evolution of Ctrip's Hickwall Monitoring and Alerting Platform: Architecture, InfluxDB Cluster, Data Aggregation, and Stream Alerting

This article details the architectural evolution of Ctrip's Hickwall monitoring and alerting platform, describing the transition from an Elasticsearch‑based first generation to an InfluxDB‑driven second generation, the design of the Incluster storage layer, data aggregation strategies, and the implementation of high‑performance stream‑based alerting.

Ctrip Technology
Ctrip Technology
Ctrip Technology
Evolution of Ctrip's Hickwall Monitoring and Alerting Platform: Architecture, InfluxDB Cluster, Data Aggregation, and Stream Alerting

Author Bio: Chen Han, an R&D engineer at Ctrip's website operations center, has been developing the Hickwall monitoring and alerting platform, gaining deep insight into monitoring systems and distributed architectures.

Monitoring and alerting form the first line of defense for website availability; Ctrip's next‑generation platform Hickwall significantly improves storage efficiency, query speed, and alarm reliability for large‑scale internet services.

1. Architecture Evolution Overview

The article first compares the first‑generation architecture, which relied on Elasticsearch as the storage engine, with the current design. The original pipeline routed data from a Proxy through formatting, rate‑limiting, Kafka, Downsample aggregation, and finally into Elasticsearch, with Redis caching for recent data. This setup suffered from excessive component count, data backlog in Kafka, and a long data chain that increased the risk of missed or false alarms.

2. InfluxDB Cluster Design

To address Elasticsearch’s shortcomings—large disk usage, high I/O, complex indexing, and slow reads/writes—the team adopted InfluxDB, the leading time‑series database, for its efficient range queries, automatic data expiration, and lower operational cost. Although early InfluxDB versions were unstable and lacked a built‑in clustering solution, the team created a custom clustering layer called Incluster.

Incluster adds a metadata layer that tracks data distribution and query routing without modifying InfluxDB’s code. It uses Raft to keep metadata consistent and consistent hashing for data placement, offering three distribution strategies (Series, Measurement, Measurement+Tag). The system can recover a failed node within about half an hour by replaying underlying TSM files.

Incluster also provides transparent InfluxQL support and a Graphite‑like query language for advanced visualizations, simplifying migration to future time‑series stores.

3. Data Aggregation Exploration

While InfluxDB excels at storage and simple queries, its Continuous Query Language (CQL) consumes excessive memory and cannot aggregate across nodes. The team therefore externalized aggregation, using ClickHouse for pre‑aggregation of high‑cardinality metrics (e.g., per‑endpoint success rates) that InfluxDB cannot handle efficiently.

4. Stream Alert Implementation

Traditional pull‑based alerting puts pressure on storage; Hickwall instead processes alerts directly from the data stream. By matching measurements and applying a Bloom filter on tag values, the system quickly determines which Trigger node needs each data point, reducing both time and space complexity.

The alert engine is built on Akka, leveraging its lightweight actor model for dynamic creation and deletion of alert contexts as services scale up or down. RocksDB, embedded via JNI, caches alert data to minimize JVM heap usage and GC overhead.

Users write alert logic using a JavaScript‑style DSL: an Init DSL handles data subscription and preprocessing (groupBy, filter, summarize, etc.), while a Run DSL defines the actual alarm conditions. The platform also offers syntax checking and historical back‑testing to aid DSL authoring.

In summary, Hickwall’s evolution—from a component‑heavy Elasticsearch pipeline to a streamlined InfluxDB‑centric architecture with robust clustering, external aggregation, and high‑performance stream‑based alerting—demonstrates a practical approach to building scalable, reliable monitoring systems for large internet enterprises.

distributed systemsmonitoringArchitecturestream processingalertingInfluxDBTime Series
Ctrip Technology
Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.