How Ctrip Handles Billions of Logs Daily: Real‑Time Monitoring, Clog, CAT & TSDB
This article details Ctrip’s large‑scale log monitoring architecture, covering the overall Overview, the Clog log system, the CAT tracing platform, and the internal TSDB solution, explaining how billions of logs are processed in real time with low latency, high reliability, and efficient querying.
Overview (Ctrip Monitoring Architecture)
This diagram shows a generic large‑scale log monitoring system architecture used by Ctrip and the industry. Data is collected via agents (Age) and SDKs (e.g., CAT, Clog, Prometheus, OpenTelemetry). Clients send logs to a Collector for aggregation and preprocessing. The Collector routes data to different Kafka topics for isolation and decoupling.
Logs are categorized into Log, Metric, and Trace, forming the three pillars of monitoring. Logs provide fine‑grained event details, Metrics are aggregatable counters, and Traces capture cross‑process call chains. Each type follows separate processing pipelines (ETL, Analyzer) and is persisted in offline warehouses (e.g., Hive), time‑series databases (TSDB), or search systems (Elasticsearch + HDFS). Visualization uses Grafana and a custom UI.
Clog – Ctrip’s Log System
Clog is a real‑time log system that avoids writing logs to local disks. It processes about 2 PB of logs per day on ~40 physical machines (32 CPU, 128 GB). The UI supports full‑text search, tag‑based filtering, and title queries.
Key design principles: “write‑heavy, read‑light”, high‑performance writes using in‑memory buffers (Mem Block) flushed sequentially to HDD, and sparse indexing (Kafka) to reduce index size while accepting slightly higher query latency. Conditional composite queries are supported via Elasticsearch inverted indexes built from log metadata (app ID, time, size, tags, etc.).
CAT – Ctrip’s Tracing Platform
CAT provides distributed tracing for billions of events daily on 140 physical machines (32 CPU, 128 GB). It shares the “write‑heavy, read‑light” characteristic.
Features include trace‑ID point queries and storage on HDFS. The original V5 version stored data and indexes in dense format (6‑byte index per record). To mitigate HDFS file‑count explosion, V6 introduced a segmented index structure: each index file contains 4096‑segment groups with a header mapping IP to segment, reducing per‑record index size to 8 bytes and enabling a secondary index.
Write flow: Data is written to a Block; when the Block reaches 64 KB it is flushed to disk, and the corresponding index entry (BlockAdd, BlockOffset) is written.
Read flow: Given a CAT message index, the system computes the offset in the index file, retrieves BlockAdd/BlockOffset, and fetches the log from HDFS via the Indexer.
File naming: Filenames encode appId, appIp, and catIp, which caused massive file proliferation; V6 removed the IP dimension.
TSDB – Time‑Series Database
Before Prometheus, Ctrip built an internal TSDB inspired by OpenTSDB, storing metrics and indexes in HBase with tags encoded in the rowkey. Later the stack migrated to VictoriaMetrics (200 machines, 40 CPU, 256 GB, 4 TB SSD) and ClickHouse (600 machines, 32 CPU, 128 GB) for high‑cardinality data. VictoriaMetrics performance degrades when tag cardinality is large due to in‑memory mapping limits; ClickHouse is used to handle such cases.
Q&A
Q: What tokenizer is used for log segmentation? A: Inverted indexes are built with Elasticsearch; logs are tokenized up to the first 128 bytes for indexing.
Q: Which framework powers the log ETL? A: A custom in‑house framework handles formatting and pre‑aggregation.
Q: What are the advantages of using ClickHouse for log storage? A: ClickHouse stores business logs with columnar schemas, offering fast queries for structured dimensions, though performance drops as the number of dimensions grows.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
