Operations 10 min read

How Meituan‑Dianping Scales Real‑Time Monitoring for Trillions of Events with CAT

This article explains how Meituan‑Dianping built the CAT platform to provide both user‑side and server‑side real‑time monitoring at trillion‑event scale, detailing its metrics, architecture evolution, storage strategies, and open‑source contributions.

Efficient Ops

Aug 21, 2019

How Meituan‑Dianping Scales Real‑Time Monitoring for Trillions of Events with CAT

1. Introduction to CAT

CAT is a real‑time monitoring platform used by Meituan‑Dianping to handle both user‑side and server‑side monitoring across all its apps, providing multi‑dimensional data analysis and alerting.

User‑side monitoring tracks experience metrics such as app launch speed and smoothness, while server‑side monitoring collects performance, exception, system, and business metrics from middleware frameworks (MVC, RPC, databases, caches, message queues, configuration systems).

Key challenges include understanding true user experience, locating root causes of server anomalies, and enabling operations teams to make scaling or degradation decisions based on QPS and response time.

User‑Side Monitoring

User dashboard for nationwide SLA analysis

User access monitoring with dimensions like response code, app source, network type, platform, and version

User resource monitoring (similar to access monitoring)

Server‑Side Monitoring

Performance, exception, system, business, and trace metrics

Transaction monitoring shows call counts, QPS, error rates, response time statistics, and supports multi‑level dimensions (time, project, machine, type, name)

Problem reports aggregate log‑viewed anomalies, providing exception names and stack details for faster troubleshooting

2. Architecture Evolution

Processing Capacity

Vertical optimization: performance tuning, message sampling and aggregation

Horizontal scaling: distributed expansion

Early 2015 architecture required manual CAT server routing configuration, high maintenance cost, and restart‑only updates, limiting horizontal scaling.

Later improvements removed local configuration dependencies, introduced dynamic load‑balancing and real‑time routing, enabling graceful horizontal expansion.

Communication Capacity

Initial deployment sent monitoring data from Beijing to Shanghai, causing cross‑region bandwidth pressure.

Subsequent expansion placed CAT servers in Beijing, keeping data ingestion local while still writing aggregated statistics and raw messages to centralized storage (database and HDFS) on an hourly basis.

To reduce cross‑region traffic spikes, storage was moved to local data centers, improving latency and stability.

Storage Capacity

All monitoring messages are processed by CAT’s consumption engine and stored as hourly reports in a sharded database; for queries spanning multiple hours, an Elasticsearch layer was added, and metric data is also streamed to Kafka for secondary aggregation.

Computing architecture 1.0 kept all data in memory with hourly persistence, offering fast access but high memory usage and low HA; architecture 2.0 separates reports and metrics, serializes metric data to a common model, writes to Kafka, and performs batch aggregation before storing in Elasticsearch.

Alerting Service

Original pull‑based, single‑node alert service could not scale horizontally, suffered high false‑positive rates, and lacked flexible strategies.

The evolved design uses Kafka for unified messaging, adds metadata for completeness, and supports multi‑condition alert rules, enabling downstream services to react with auto‑scaling or circuit‑breaking.

3. Open Source Community

CAT 2.0 was released in October 2018, featuring a slimmer Java client, support for multiple languages (C/C++, Python, Go), and sampling that can compensate for full‑volume data.

Significant storage optimizations were made; the source code is available at https://github.com/dianping/cat for further exploration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Real-time architecture scalability

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.