Operations 17 min read

How CAT Powers Real‑Time Distributed Monitoring at Scale

This article introduces CAT, a Java‑based open‑source distributed real‑time monitoring system, covering its origins, design goals, architecture, message processing pipeline, instrumentation model, and how it achieves high availability, scalability, and low‑latency analytics for large‑scale internet services.

21CTO

Mar 20, 2016

How CAT Powers Real‑Time Distributed Monitoring at Scale

CAT Overview

CAT (Central Application Tracking) is a pure‑Java, open‑source distributed real‑time monitoring system originally developed at Dianping in 2011, inspired by eBay's CAL. Its source code is hosted on GitHub and was created by Wu Qimin and You Yong.

Current Status

Released under the Apache License, CAT is used by over 100 Chinese internet companies, including Dianping, Ctrip, Liepin, LuJinSuo, and Zhaogang. By March 2016 it had earned more than 1,000 GitHub stars.

Design Goals

Scalable: supports distributed, cross‑IDC deployment and horizontal scaling.

High availability: monitoring continues even if some applications fail.

Real‑time processing: information loses value quickly during incident handling.

Full data coverage: captures low‑probability events that may become critical.

High throughput: requires strong processing capacity to reconstruct incidents.

Fault tolerance: CAT failures should not affect business services.

Best‑effort reliability: allows message loss, aiming for four‑nine reliability.

Architecture

The design emphasizes simplicity, decentralization, and component collaboration. It uses a two‑layer structure, relying only on external storage such as HDFS and MySQL, and adopts a fully componentized implementation. Because a single machine cannot handle the massive message volume, CAT shards traffic and balances load across nodes.

Applications embed CAT API for instrumentation; backend threads use persistent TCP connections to stream messages to servers. CAT provides fail‑over to alternate servers when needed and currently uses a native serialization protocol, with plans to support additional protocols such as Thrift.

Messages are deserialized, placed into queues, and dispatched by a real‑time scheduler to independent consumer queues. Consumers typically perform incremental, stream‑based calculations and store hourly reports in a central database.

Daily reports are merged from 24 hourly reports; weekly reports are merged from seven daily reports, and so on.

The CAT console (UI) retrieves reports from storage for display. Real‑time reports are served via HTTP aggregation; historical reports are read directly from the database.

Raw messages are first stored locally, then uploaded to HDFS, while reports are stored as key/value pairs in MySQL.

Message Processing Pipeline

Processing consists of five stages: collection, transmission, analysis, storage, and presentation.

Collection : Applications instrument logs, which are sent as a message tree to a transmission queue. If the queue is full, messages are dropped.

Transmission : CAT clients maintain long‑lived TCP connections to CAT consumers, serializing message trees onto the network; consumers deserialize and enqueue them.

Analysis : A real‑time scheduler distributes messages to consumer queues. Report analyzers consume their own queues to update report models. A special raw‑log dump analyzer stores messages to the local file system without generating reports.

Storage : Reports reside in MySQL; compressed raw logs are persisted in HDFS for long‑term retention (reports typically kept >3 months, raw logs ~1 month).

Presentation : The UI layer aggregates real‑time results into HTML for browsers; historical reports are fetched directly from the database. XML output is also available for external tools.

Log Instrumentation

Instrumentation is crucial; quality determines monitoring effectiveness. CAT focuses on problem‑centric logging, treating any deviation from expectations (exceptions, latency spikes, TPS anomalies, etc.) as a problem.

Typical cross‑boundary scenarios include HTTP/REST, RPC/SOA, MQ, jobs, caches, DAL, search engines, third‑party gateways, and business metrics such as PV, login counts, order volume, and revenue.

Supported client languages are Java and .NET, with plans for C/C++, NodeJS, Go, PHP, and Python.

Domain Modeling

The model covers most daily instrumentation needs, offering customized analysis and visualization for various business types.

Message Tree

The message tree records the exact execution sequence of a user request across multiple machines, preserving nesting, ordering, and parallelism, which is invaluable for troubleshooting complex issues.

Following CAT’s instrumentation guidelines automatically stitches message trees from different machines into a hierarchical view, with an alternative human‑friendly visualization for quick problem identification.

CAT API

The API is designed to be lightweight, with future optimization opportunities. Most instrumentation occurs at the infrastructure layer, requiring minimal changes to application code. Some products use java‑agent style instrumentation; the quality of instrumentation depends on existing code standards and coverage.

Real‑Time Analysis

CAT does not simply compute results for whatever the user requests; instead, it tailors real‑time calculations based on log characteristics and problem scenarios. Reports are partitioned hourly; each hour’s report is computed in memory, giving the impression of real‑time results. Historical reports are immutable, so real‑time latency is irrelevant.

In‑memory incremental calculations cover three categories: counting, timing, and relational processing. Counting includes arithmetic (count, sum, avg, max/min, tps, std) and set‑based metrics (95th percentile, 99.9th percentile, DAU). Relational processing involves graph operations.

Report Modeling

Each report may have multiple dimensions; for example, a transaction report has five dimensions (application, machine, major category, minor category, distribution). To avoid excessive overhead, CAT adopts a fixed‑dimension model, organizing dimensions into a depth‑5 tree and always traversing from the root.

Each report runs in its own thread, eliminating lock contention; report models are intentionally non‑thread‑safe for simplicity and low overhead.

Report code is generated by a custom Maven plugin and heavily uses the Visitor pattern.

Message Storage

Message storage is the most challenging part of CAT. Large‑scale services like Dianping and Ctrip generate 300‑400 billion messages daily (≈50 TB), about 1 GB per second. With a 10‑node cluster, each node processes 5‑6 万 messages per second (≈100 MB). CAT employs back‑pressure and discarding mechanisms to protect application performance.

Summary

Decentralized data partitioning.

Log‑read‑only model with hourly in‑memory real‑time reports and aggregated historical reports.

Fully asynchronous, single‑threaded, lock‑free design.

Global message IDs, localized production, centralized storage.

Component‑based, service‑oriented philosophy promoting tool interoperability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java architecture real-time analytics Observability CAT Distributed Monitoring

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.