How CAT Enables Scalable Real‑Time Monitoring for Distributed Systems
This article introduces CAT, an open‑source Java‑based distributed real‑time monitoring platform, detailing its design goals, architecture, message processing pipeline, logging instrumentation, API, real‑time analysis, report modeling, storage challenges, and key takeaways for building highly available, scalable monitoring solutions.
In late 2011, the author joined Dazhong Dianping and started developing CAT, a distributed real‑time monitoring system inspired by eBay's CAL.
Modern internet services adopt agile development and service‑oriented architectures, leading to frequent failures and complex fault isolation. Existing log tools lack integration across services, and massive dispersed logs make root‑cause analysis difficult, motivating CAT.
CAT Overview
CAT (Central Application Tracking) is a pure‑Java, open‑source distributed real‑time monitoring platform hosted on GitHub, authored by Wu Qimin and You Yong.
Current Adoption
Released under the Apache License, CAT is used by over 100 Chinese internet companies, including Dazhong Dianping, Ctrip, Liepin, Lu Jin Suo, and Zhaogang, and has earned more than 1,000 GitHub stars as of March 2016.
Design Goals
Scalable: supports distributed, cross‑IDC deployment and horizontal scaling.
Highly available: monitoring continues even if applications fail.
Real‑time processing: information loses value quickly during incidents.
Full‑volume data: rare events must be captured.
High throughput: requires strong processing capacity.
Fault tolerance: CAT failures should not affect business services.
Best‑effort reliability: allows message loss, achieving ~99.9% reliability.
Architecture
The architecture emphasizes simplicity, decentralization, and collaborative components. It consists of two layers and relies only on external storage such as HDFS and MySQL.
Clients embed CAT API calls; messages are sent via persistent TCP connections to backend servers, which deserialize and enqueue them for asynchronous processing. Consumers handle their own queues independently; overloaded queues cause message drops.
Reports are generated hourly and stored in MySQL, while raw logs are compressed and saved to HDFS. The UI console retrieves reports for real‑time or historical display.
Message Processing Stages
Messages pass through five stages: collection, transmission, analysis, storage, and presentation.
Collection : Application code instruments logs into a “message tree”. If the queue is full, messages are discarded.
Transmission : Clients maintain long‑lived TCP connections; messages are serialized, sent, deserialized, and placed into consumer queues.
Analysis : Real‑time schedulers dispatch messages to consumer‑specific queues; report generators consume these queues to update hourly reports. A special raw‑log dump consumer stores messages to the file system.
Storage : Reports reside in MySQL; raw logs are compressed and stored in HDFS for up to a month, while reports are kept for three months or longer.
Presentation : The UI aggregates real‑time results from consumers or fetches historical data from the database; XML output is also supported for external tools.
Logging and Instrumentation
Logging is a critical monitoring step; CAT’s instrumentation focuses on problem‑centric events such as exceptions, latency spikes, or abnormal TPS. Supported client languages include Java and .NET, with plans for C/C++, Node.js, Go, PHP, and Python.
Domain Modeling
The model covers most daily instrumentation needs, providing personalized analysis and visualization for various event types.
Message Tree
The message tree records the exact execution sequence of a user request across multiple machines, preserving nesting, ordering, and parallelism, which aids developers in tracing complex issues.
CAT API
The API is designed to be lightweight with future optimization potential. Most instrumentation occurs at the infrastructure layer; applications rarely need direct calls. Alternative approaches such as Java agents are also supported.
Real‑Time Analysis
CAT computes results in memory for the current hour, delivering “real‑time” reports on each request. Historical reports are immutable and thus not time‑critical.
Incremental calculations cover counting, timing, and relational processing, including arithmetic counts (count, sum, avg, max/min, TPS, std) and set‑based metrics (95th/99.9th percentile, DAU).
Report Modeling
Reports use a fixed‑dimension tree structure to balance flexibility and overhead. Each report runs in its own thread without locks, keeping the model non‑thread‑safe but simple and low‑cost.
Message Storage Challenges
Message volume can reach 300‑400 billion events per day (≈50 TB), with peak rates of 500‑600 k messages per second (≈1 GB/s). Scaling to ten machines still requires processing 50‑60 k messages per second per node.
Key Takeaways
Decentralized data partitioning.
One‑hour windows for in‑memory real‑time reports; historical aggregation.
Fully asynchronous, single‑threaded, lock‑free design.
Global message IDs with localized production and centralized storage.
Component‑based, service‑oriented architecture promoting tool interoperability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
