Big Data 11 min read

Real‑Time Low‑Latency Log Monitoring and Storage at Ctrip: Architecture, Clog System, CAT Tracing, and TSDB

This article details Ctrip's large‑scale, real‑time log monitoring solution, covering the overall monitoring architecture, the Clog log system, the CAT tracing platform, and the TSDB metric store, and explains design choices such as write‑heavy indexing, segment‑based storage, and migration to ClickHouse for high‑cardinality data.

DataFunTalk
DataFunTalk
DataFunTalk
Real‑Time Low‑Latency Log Monitoring and Storage at Ctrip: Architecture, Clog System, CAT Tracing, and TSDB

Overview – Ctrip Monitoring Architecture Ctrip processes billions of log entries daily and uses a unified monitoring system that collects data via agents (Agent and SDK) supporting protocols like CAT, Clog, Prometheus, and OpenTelemetry. Collected logs are sent to a Collector for aggregation and pre‑processing, then routed to different Kafka topics for isolation and decoupling.

Logs are classified into three data models—Log, Metric, and Trace—forming the three pillars of monitoring. Logs provide fine‑grained event details, Metrics are aggregatable counters, and Traces capture cross‑process call chains. After ingestion, data flows through ETL/Analyzer pipelines and is persisted either in offline warehouses (HIVE) for big‑data analysis, TSDB for time‑series aggregation, or ES+HDFS for search.

Clog – Ctrip Log System Clog is a real‑time, disk‑free log system handling ~2 PB of data per day on a 40‑node cluster. It emphasizes write‑heavy, read‑light workloads, using in‑memory blocks (Mem Block) that are flushed sequentially to HDD to achieve high throughput while keeping latency low. Indexing adopts sparse Kafka‑style indexes to reduce space, and query conditions are stored in Elasticsearch inverted indexes for flexible UI searches.

CAT – Tracing Platform CAT processes ~2 PB of trace data on a 140‑node cluster. It shares the write‑heavy, read‑light characteristics of Clog. Traces are stored in HDFS, with dense indexes mapping each message to a 6‑byte offset in V5. V6 replaces the per‑IP file naming to curb HDFS file explosion, introduces segment‑based indexes (4096‑entry segments with headers) and a secondary index to locate blocks efficiently.

TSDB – Metric Store Before Prometheus, Ctrip built an internal TSDB inspired by OpenTSDB, storing data and indexes in HBase with tags encoded in row keys. Later migrations moved to VictoriaMetrics (200‑node cluster) and ClickHouse (600‑node cluster) to handle high‑cardinality metrics, noting that ClickHouse’s columnar design excels for business logs but can degrade with many dimensions.

Q&A The inverted index uses Elasticsearch; only the first 128 bytes of a log message are tokenized for search. ETL is performed by a proprietary framework for formatting and pre‑aggregation. ClickHouse stores business logs with fixed columns, offering fast queries but limited scalability when dimensions increase.

Overall, the talk reveals how Ctrip balances massive data volume, low latency, and efficient query capabilities through careful architecture, storage format choices, and progressive system upgrades.

distributed systemsBig DataReal-time ProcessingIndexingtime-series databasetraceabilityLog Monitoring
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.