Operations 17 min read

How Ctrip Built a Scalable Observability Platform and AIOps Engine for Millions of Metrics and Logs

This article details Ctrip's end‑to‑end observability platform—covering metrics, logging, and tracing—its architecture, data governance, AIOps capabilities, and practical case studies, while addressing challenges like data volume, alert noise, and metric explosion in a massive micro‑service environment.

dbaplus Community
dbaplus Community
dbaplus Community
How Ctrip Built a Scalable Observability Platform and AIOps Engine for Millions of Metrics and Logs

Ctrip operates a massive travel‑service platform with nearly 20,000 business applications, about 400,000 instances, 50 billion metrics per minute, and 2 PB of new log data daily. Their observability data is organized around three pillars: Metrics, Logging, and Trace.

1. Ctrip Observability Platform Overview

Metrics : system performance (CPU, memory, disk) and business‑level indicators (request count, error count, response time, custom business metrics such as order‑channel traffic).

Logging : system logs (/var/log/message, security logs), application logs (error, order, user‑query logs), and third‑party logs (open‑source product outputs, load‑balancer logs). Logs are enriched with traceid to link with traces.

Trace : generated by embedding traceid in logs, enabling end‑to‑end call‑chain visualization across micro‑services.

These three data types interrelate: trace links logs, logs can be turned into metrics (e.g., error‑rate metrics), and metrics can be used to locate root causes.

Observability Data Benefits

Monitoring & Alerting : alarms on business‑level metrics and OS anomalies.

Fault Handling : unified view across the three pillars enables rapid fault isolation and supports self‑healing.

Root‑Cause Analysis : combine trace and metrics to pinpoint impact scope.

Key Challenges

Massive micro‑service architecture – ever‑growing service count and daily changing call graphs produce huge observability data.

Cloud‑native HPA creates rapid instance scaling, generating massive new time‑series that stress TSDBs.

1‑5‑10 goal: detect alerts within 1 min, locate issues within 5 min, and remediate within 10 min.

Platform stability under high data volume and numerous alerts.

Data timeliness – avoid false alerts from delayed or missing data.

Query efficiency – millisecond‑level metric queries, sub‑second log queries within an hour, and log retention >7 days.

2. Platform Architecture

The platform uses a unified Grafana‑based UI, integrating configuration dashboards, charts, log and trace queries. Under the hood:

Metrics Query Layer : abstracts multiple time‑series databases (clustered by BU or logical division) into a single data source, providing raw data management and governance.

Logging Query Layer : built on ClickHouse, supports hot‑cold storage tiers and archiving.

Trace System : extended from CAT, adds metric‑log correlation, supports OpenTelemetry, and offers global reporting.

3. AIOps Platform Practices

The AIOps platform focuses on three domains: monitoring & alerting, capacity management, and change management.

Smart Alerting : dynamic thresholds derived from data curves reduce false positives.

Fault Auto‑Healing : predefined scenarios enable automatic root‑cause detection and remediation.

Capacity Management : predicts traffic spikes (e.g., holidays) and pre‑configures HPA to handle load.

Change Management : during deployments, the system captures metrics (request count, error rate, latency) and automatically brakes releases when anomalies appear.

4. Observability Data Governance

Log Architecture & Governance

Logs flow from client‑side instrumentation and system agents to a Kafka gateway, then to a log cluster offering metadata management and query gateways (API & UI).

Growth drivers include business expansion (≈50 new log scenarios per month, 2 PB daily growth), audit requirements, and developer habits of verbose logging.

Governance measures:

Unified query layer, storage engine, and raw‑log management to enforce best practices.

Query governance: intelligent SQL rewriting, QPS limits, and restrictions on heavy scans.

Log best‑practice: standardized schemas, approval workflow, retention policies, and quota control at the source.

Cold‑hot storage tiering with local disk + object storage, tenant‑level expansion policies.

Alert Governance

Alert volume grew ~30 % year‑over‑year, risking alert fatigue. Measures include a unified alert center, standardized severity levels, response‑time SLAs, escalation mechanisms, and noise‑reduction via aggregation, auto‑suppression, and convergence.

Metric Inflation Governance

HPA‑driven instance churn inflates metric cardinality, stressing TSDBs. Solutions:

Pre‑aggregation to collapse high‑cardinality dimensions (IP, instance ID).

Automatic detection of abnormal high‑cardinality metrics, disabling them and notifying owners.

Filtering to promote application‑level dimensions over instance‑level ones.

5. Architecture Upgrades for AIOps

Unified Time‑Series DB & Query入口 : uses open‑source VictoriaMetrics with BU‑level clusters, wrapped by a unified query proxy. Raw metric names and fields are also stored in ClickHouse for metadata management.

Log Query Layer Design : multiple ClickHouse clusters merged via a proxy; SQL is parsed, rewritten, and split per cluster, then results are merged.

Unified Agent Collection : a single agent service gathers system, kernel, and custom logs, sending all observability data through a common pipeline.

Data Governance & Value Realization : unified query/storage, quality governance at write and read paths, classification of high‑value vs. low‑value data, archiving low‑value data to cut costs.

6. Case Studies & Future Outlook

Automated Disk‑Failure Handling : AI detects faulty disks, calls APIs to detach them, and reintegrates after repair, achieving fully automated remediation.

Intelligent Alerting : users set sensitivity and type; AI learns thresholds from history and triggers alerts on anomalies such as success‑rate drops.

Fault Localization : combines metrics and trace data; high‑error applications with many downstream dependencies are identified as root causes.

AIOps Assistant & Large‑Model Integration : an interactive chatbot provides rule explanations, suggests remediation, summarizes incident discussions, and offers improvement recommendations.

Future plans include expanding fault prediction, deeper root‑cause analysis, and broader AI‑assisted decision making to further boost operational efficiency.

Overall Impact : Ctrip’s observability platform now covers >99 % of machines, supports massive data volumes, and powers a robust AIOps engine that meets the 1‑5‑10 operational goals.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

MonitoringMetricsloggingTracingaiopsCtripcloud‑native
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.