Operations 10 min read

Intelligent Monitoring System for Ctrip Hotels: Design, Implementation, and Lessons Learned

This article describes the design and implementation of Ctrip's hotel intelligent monitoring platform, detailing its architecture, key components such as Smart, Mdata, Artemis, and Clog monitoring, the challenges of massive log data, and the achieved improvements in real‑time alerting and testing efficiency.

Ctrip Technology

Sep 12, 2018

Intelligent Monitoring System for Ctrip Hotels: Design, Implementation, and Lessons Learned

Ctrip's hotel business generates massive amounts of telemetry data, making traditional monitoring insufficient for quickly pinpointing issues; testers spend excessive time analyzing anomalies.

To address these pain points, a suite of business‑oriented monitoring tools was built, including the Smart intelligent monitoring platform, the Mdata performance‑point platform, and the Artemis API automation monitoring system, all focused on real‑time analysis of Clog and Elasticsearch (ES) data.

The monitoring landscape includes over 2,000 applications, requiring both proactive detection (simulated user behavior) and passive data collection via embedded instrumentation. Challenges include huge data volume (>200 billion logs, >100 TB daily), multi‑dimensional metrics, and lack of fine‑grained rule configuration.

Key objectives were to make monitoring business‑centric, extensible, near‑real‑time, broadly covered, and centrally managed, achieved through rule‑based configurations, distributed execution agents, and integration with CAT, NOC, release, and SLB systems.

Clog monitoring evolved from version 1.0 (application‑level log thresholds) to version 2.0 (department‑level intelligent issue detection), with automated rule generation, email alerts, and CP4 bug creation, processing over 300 k tasks daily with a minimum granularity of 20 seconds.

Mdata provides simple JSON‑based ES rule configuration, supporting percentage and absolute thresholds, ring‑ratio alerts, and fast data collection (<10 minutes). Artemis extends this with custom multi‑index aggregation, stricter thresholds, and sub‑2‑minute alert latency for critical business metrics.

The platform now monitors more than 30 proactive automated checks and dozens of system‑level metrics (e.g., machine status, job health, DB consistency, CAT response times), covering all core hotel applications and generating thousands of actionable alerts and bugs.

Future directions include leveraging machine‑learning models to further reduce noise, improve alert precision, and achieve smarter, faster monitoring across the increasingly complex hotel ecosystem.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

automation

Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.