How Ctrip Built a Scalable Observability Platform and AIOps Engine for Millions of Metrics and Logs
This article details Ctrip's end‑to‑end observability platform—covering metrics, logging, and tracing—its architecture, data governance, AIOps capabilities, and practical case studies, while addressing challenges like data volume, alert noise, and metric explosion in a massive micro‑service environment.
Ctrip operates a massive travel‑service platform with nearly 20,000 business applications, about 400,000 instances, 50 billion metrics per minute, and 2 PB of new log data daily. Their observability data is organized around three pillars: Metrics, Logging, and Trace.
1. Ctrip Observability Platform Overview
Metrics : system performance (CPU, memory, disk) and business‑level indicators (request count, error count, response time, custom business metrics such as order‑channel traffic).
Logging : system logs (/var/log/message, security logs), application logs (error, order, user‑query logs), and third‑party logs (open‑source product outputs, load‑balancer logs). Logs are enriched with traceid to link with traces.
Trace : generated by embedding traceid in logs, enabling end‑to‑end call‑chain visualization across micro‑services.
These three data types interrelate: trace links logs, logs can be turned into metrics (e.g., error‑rate metrics), and metrics can be used to locate root causes.
Observability Data Benefits
Monitoring & Alerting : alarms on business‑level metrics and OS anomalies.
Fault Handling : unified view across the three pillars enables rapid fault isolation and supports self‑healing.
Root‑Cause Analysis : combine trace and metrics to pinpoint impact scope.
Key Challenges
Massive micro‑service architecture – ever‑growing service count and daily changing call graphs produce huge observability data.
Cloud‑native HPA creates rapid instance scaling, generating massive new time‑series that stress TSDBs.
1‑5‑10 goal: detect alerts within 1 min, locate issues within 5 min, and remediate within 10 min.
Platform stability under high data volume and numerous alerts.
Data timeliness – avoid false alerts from delayed or missing data.
Query efficiency – millisecond‑level metric queries, sub‑second log queries within an hour, and log retention >7 days.
2. Platform Architecture
The platform uses a unified Grafana‑based UI, integrating configuration dashboards, charts, log and trace queries. Under the hood:
Metrics Query Layer : abstracts multiple time‑series databases (clustered by BU or logical division) into a single data source, providing raw data management and governance.
Logging Query Layer : built on ClickHouse, supports hot‑cold storage tiers and archiving.
Trace System : extended from CAT, adds metric‑log correlation, supports OpenTelemetry, and offers global reporting.
3. AIOps Platform Practices
The AIOps platform focuses on three domains: monitoring & alerting, capacity management, and change management.
Smart Alerting : dynamic thresholds derived from data curves reduce false positives.
Fault Auto‑Healing : predefined scenarios enable automatic root‑cause detection and remediation.
Capacity Management : predicts traffic spikes (e.g., holidays) and pre‑configures HPA to handle load.
Change Management : during deployments, the system captures metrics (request count, error rate, latency) and automatically brakes releases when anomalies appear.
4. Observability Data Governance
Log Architecture & Governance
Logs flow from client‑side instrumentation and system agents to a Kafka gateway, then to a log cluster offering metadata management and query gateways (API & UI).
Growth drivers include business expansion (≈50 new log scenarios per month, 2 PB daily growth), audit requirements, and developer habits of verbose logging.
Governance measures:
Unified query layer, storage engine, and raw‑log management to enforce best practices.
Query governance: intelligent SQL rewriting, QPS limits, and restrictions on heavy scans.
Log best‑practice: standardized schemas, approval workflow, retention policies, and quota control at the source.
Cold‑hot storage tiering with local disk + object storage, tenant‑level expansion policies.
Alert Governance
Alert volume grew ~30 % year‑over‑year, risking alert fatigue. Measures include a unified alert center, standardized severity levels, response‑time SLAs, escalation mechanisms, and noise‑reduction via aggregation, auto‑suppression, and convergence.
Metric Inflation Governance
HPA‑driven instance churn inflates metric cardinality, stressing TSDBs. Solutions:
Pre‑aggregation to collapse high‑cardinality dimensions (IP, instance ID).
Automatic detection of abnormal high‑cardinality metrics, disabling them and notifying owners.
Filtering to promote application‑level dimensions over instance‑level ones.
5. Architecture Upgrades for AIOps
Unified Time‑Series DB & Query入口 : uses open‑source VictoriaMetrics with BU‑level clusters, wrapped by a unified query proxy. Raw metric names and fields are also stored in ClickHouse for metadata management.
Log Query Layer Design : multiple ClickHouse clusters merged via a proxy; SQL is parsed, rewritten, and split per cluster, then results are merged.
Unified Agent Collection : a single agent service gathers system, kernel, and custom logs, sending all observability data through a common pipeline.
Data Governance & Value Realization : unified query/storage, quality governance at write and read paths, classification of high‑value vs. low‑value data, archiving low‑value data to cut costs.
6. Case Studies & Future Outlook
Automated Disk‑Failure Handling : AI detects faulty disks, calls APIs to detach them, and reintegrates after repair, achieving fully automated remediation.
Intelligent Alerting : users set sensitivity and type; AI learns thresholds from history and triggers alerts on anomalies such as success‑rate drops.
Fault Localization : combines metrics and trace data; high‑error applications with many downstream dependencies are identified as root causes.
AIOps Assistant & Large‑Model Integration : an interactive chatbot provides rule explanations, suggests remediation, summarizes incident discussions, and offers improvement recommendations.
Future plans include expanding fault prediction, deeper root‑cause analysis, and broader AI‑assisted decision making to further boost operational efficiency.
Overall Impact : Ctrip’s observability platform now covers >99 % of machines, supports massive data volumes, and powers a robust AIOps engine that meets the 1‑5‑10 operational goals.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
