Design and Architecture of the CAT Real-Time Monitoring System
The CAT real‑time monitoring system, open‑sourced in 2014 for Java applications, combines a lightweight ThreadLocal‑based client SDK, Netty‑driven asynchronous transport, and a highly scalable backend that processes ~100 TB of logs daily across 70 machines, using custom binary serialization, in‑memory modeling, segmented storage with 48‑bit indexing, and hourly aggregation to provide near‑full‑volume fault detection, localization, and performance analysis.
CAT (Central Application Tracking) is a near‑real‑time, near‑full‑volume monitoring system focused on Java applications. It was open‑sourced in 2014 and is used by Meituan‑Dianping and many other Internet companies.
Background : The system originated in 2011 when Meituan‑Dianping migrated from .NET to Java and needed a unified monitoring solution to replace fragmented tools such as Zabbix and Hawk.
Overall Design : CAT aims for fast fault detection, rapid fault localization, and performance optimization. Non‑functional requirements include real‑time processing, full‑volume data collection, high availability, fault tolerance, high throughput, scalability, and an intentional trade‑off allowing occasional message loss (four‑nines reliability).
The architecture consists of three modules: CAT‑client (SDK for applications and middleware), CAT‑consumer (real‑time analysis), and CAT‑home (web UI). Consumer and home can run in the same JVM, reducing hierarchy depth and improving stability.
Client Design : The client uses ThreadLocal to store a monitoring context per thread, ensuring zero impact on business performance. Data is queued asynchronously and sent by a dedicated consumer thread. The API defines core monitoring objects: Transaction, Event, Heartbeat, Metric.
Serialization and Transport : CAT uses a custom binary serialization protocol for efficiency and Netty‑based NIO for network transmission.
Instrumentation : Logging points focus on problem‑centric events (exceptions, latency spikes, TPS anomalies, etc.) across HTTP/REST, RPC, MQ, jobs, caches, and data access layers.
Server Design : The backend processes ~100 TB of log data daily (≈1000 billion messages). It runs on a cluster of ~35 machines for computation and another ~35 for storage, handling peak inbound traffic of ~110 MB/s. The server is fully asynchronous: Netty receives messages, stores them in an in‑memory queue, and worker threads consume them. Messages are first written to local disk, then asynchronously uploaded to HDFS.
Real‑Time Analysis : Reports are generated per hour, stored in memory, and merged into daily, weekly, and monthly aggregates. Models support count, timing, and relational calculations, including percentiles (95th, 99.9th) and standard deviation.
Storage Design : Raw logview data (~100 TB/day) is stored in compressed segment files with a 48‑bit index per Message‑ID, enabling fast random reads. The index encodes the data file offset (32 bits) and intra‑segment offset (16 bits).
Summary : CAT achieves distributed real‑time monitoring through decentralization, log‑only read semantics, in‑memory modeling, global Message‑IDs, and a component‑based service architecture.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
