How iQIYI Built a Scalable CAT‑Based Monitoring Platform for 100+ Microservices
This case study outlines iQIYI's LEDAO middle‑platform monitoring challenges, evaluates open‑source solutions, details the selection and customization of CAT, and presents deployment, integration, health‑check, and alerting enhancements that now support over 100 microservices across multiple regions.
System monitoring is essential for project integrity; without it, failures go unnoticed, causing severe losses. iQIYI's LEDAO middle‑platform, handling video, audio, subtitles, and images, grew to over 100 microservices, making existing monitoring insufficient and creating urgent needs such as rapid anomaly detection, observable deployments, comprehensive performance metrics, stable health checks, container resource monitoring, and clear business‑level dashboards.
Monitoring Requirements
Machine monitoring : CPU, memory, disk, network metrics (e.g., cpu.busy, mem.memfree, net.if.in.bytes).
System monitoring : service status, QPS, interface performance, success rates, error logs, cluster‑wide traffic.
Business monitoring : domain‑specific KPIs, production counts, success rates, latency, supporting decision‑making.
Technology Evaluation
The team surveyed internal tools and three popular open‑source monitoring systems:
OpenFalcon – easy to integrate, low intrusion, but focuses mainly on machine metrics.
Prometheus – powerful query engine (PromQL) and pull‑based collection, but requires Grafana for dashboards and has a steeper learning curve.
CAT (by Meituan) – comprehensive real‑time monitoring and alerting, covering most scenarios, though integration cost is higher.
Considering the need for a complete, stable, and mature UI with robust reporting and alerting, CAT was chosen.
LEDAO‑CAT Deployment
Initial deployment used the open‑source GitHub repository for a minimal setup on a few VMs. After successful trials, the system was upgraded repeatedly to fit LEDAO's scale. Two clusters (overseas and Mainland China) now run CAT, serving 100+ microservices, handling >10k TPS and processing ~1.5 TB of data daily.
Services integrate via the ledao-cat-client package, which configures monitoring with minimal code changes.
Key Enhancements
1. Flexible Integration : Replaced static client.xml files with three configurable methods – traditional XML, QAE environment variables, and properties files – reducing operational overhead and supporting containerized deployments.
2. Extended Instrumentation : Developed a proxy package offering multiple injection styles – AOP, declarative annotations (service, method, controller, DAO), and bulk property configuration.
3. Health‑Check Module : Added a module that periodically probes services from different data centers; repeated failures trigger alerts, addressing the previous blind spot where client crashes stopped reporting.
4. Alerting Integration : Merged CAT alerts with iQIYI's internal notification channels (email, instant chat, SMS), greatly improving alert visibility and response speed.
Results
New services can be onboarded in under 5 minutes.
Instrumentation is nearly non‑intrusive thanks to the proxy package.
Comprehensive coverage of hardware, health, anomalies, performance, and business metrics across the microservice ecosystem.
Robust alerting enables real‑time detection and rapid remediation of issues.
The LEDAO‑CAT platform now underpins the stability of iQIYI's rapidly expanding services, with ongoing work on distributed transactions, richer business dashboards, and deeper integration with Nacos.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
