How Meituan‑Dianping Scales Real‑Time Monitoring for Trillions of Events with CAT
This article explains how Meituan‑Dianping built the CAT platform to provide both user‑side and server‑side real‑time monitoring at trillion‑event scale, detailing its metrics, architecture evolution, storage strategies, and open‑source contributions.
1. Introduction to CAT
CAT is a real‑time monitoring platform used by Meituan‑Dianping to handle both user‑side and server‑side monitoring across all its apps, providing multi‑dimensional data analysis and alerting.
User‑side monitoring tracks experience metrics such as app launch speed and smoothness, while server‑side monitoring collects performance, exception, system, and business metrics from middleware frameworks (MVC, RPC, databases, caches, message queues, configuration systems).
Key challenges include understanding true user experience, locating root causes of server anomalies, and enabling operations teams to make scaling or degradation decisions based on QPS and response time.
User‑Side Monitoring
User dashboard for nationwide SLA analysis
User access monitoring with dimensions like response code, app source, network type, platform, and version
User resource monitoring (similar to access monitoring)
Server‑Side Monitoring
Performance, exception, system, business, and trace metrics
Transaction monitoring shows call counts, QPS, error rates, response time statistics, and supports multi‑level dimensions (time, project, machine, type, name)
Problem reports aggregate log‑viewed anomalies, providing exception names and stack details for faster troubleshooting
2. Architecture Evolution
Processing Capacity
Vertical optimization: performance tuning, message sampling and aggregation
Horizontal scaling: distributed expansion
Early 2015 architecture required manual CAT server routing configuration, high maintenance cost, and restart‑only updates, limiting horizontal scaling.
Later improvements removed local configuration dependencies, introduced dynamic load‑balancing and real‑time routing, enabling graceful horizontal expansion.
Communication Capacity
Initial deployment sent monitoring data from Beijing to Shanghai, causing cross‑region bandwidth pressure.
Subsequent expansion placed CAT servers in Beijing, keeping data ingestion local while still writing aggregated statistics and raw messages to centralized storage (database and HDFS) on an hourly basis.
To reduce cross‑region traffic spikes, storage was moved to local data centers, improving latency and stability.
Storage Capacity
All monitoring messages are processed by CAT’s consumption engine and stored as hourly reports in a sharded database; for queries spanning multiple hours, an Elasticsearch layer was added, and metric data is also streamed to Kafka for secondary aggregation.
Computing architecture 1.0 kept all data in memory with hourly persistence, offering fast access but high memory usage and low HA; architecture 2.0 separates reports and metrics, serializes metric data to a common model, writes to Kafka, and performs batch aggregation before storing in Elasticsearch.
Alerting Service
Original pull‑based, single‑node alert service could not scale horizontally, suffered high false‑positive rates, and lacked flexible strategies.
The evolved design uses Kafka for unified messaging, adds metadata for completeness, and supports multi‑condition alert rules, enabling downstream services to react with auto‑scaling or circuit‑breaking.
3. Open Source Community
CAT 2.0 was released in October 2018, featuring a slimmer Java client, support for multiple languages (C/C++, Python, Go), and sampling that can compensate for full‑volume data.
Significant storage optimizations were made; the source code is available at https://github.com/dianping/cat for further exploration.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.