Operations 12 min read

How iQIYI Built a Scalable CAT‑Based Monitoring Platform for 100+ Microservices

This case study outlines iQIYI's LEDAO middle‑platform monitoring challenges, evaluates open‑source solutions, details the selection and customization of CAT, and presents deployment, integration, health‑check, and alerting enhancements that now support over 100 microservices across multiple regions.

dbaplus Community
dbaplus Community
dbaplus Community
How iQIYI Built a Scalable CAT‑Based Monitoring Platform for 100+ Microservices

System monitoring is essential for project integrity; without it, failures go unnoticed, causing severe losses. iQIYI's LEDAO middle‑platform, handling video, audio, subtitles, and images, grew to over 100 microservices, making existing monitoring insufficient and creating urgent needs such as rapid anomaly detection, observable deployments, comprehensive performance metrics, stable health checks, container resource monitoring, and clear business‑level dashboards.

Monitoring Requirements

Machine monitoring : CPU, memory, disk, network metrics (e.g., cpu.busy, mem.memfree, net.if.in.bytes).

System monitoring : service status, QPS, interface performance, success rates, error logs, cluster‑wide traffic.

Business monitoring : domain‑specific KPIs, production counts, success rates, latency, supporting decision‑making.

Technology Evaluation

The team surveyed internal tools and three popular open‑source monitoring systems:

OpenFalcon – easy to integrate, low intrusion, but focuses mainly on machine metrics.

Prometheus – powerful query engine (PromQL) and pull‑based collection, but requires Grafana for dashboards and has a steeper learning curve.

CAT (by Meituan) – comprehensive real‑time monitoring and alerting, covering most scenarios, though integration cost is higher.

Considering the need for a complete, stable, and mature UI with robust reporting and alerting, CAT was chosen.

LEDAO‑CAT Deployment

Initial deployment used the open‑source GitHub repository for a minimal setup on a few VMs. After successful trials, the system was upgraded repeatedly to fit LEDAO's scale. Two clusters (overseas and Mainland China) now run CAT, serving 100+ microservices, handling >10k TPS and processing ~1.5 TB of data daily.

Services integrate via the ledao-cat-client package, which configures monitoring with minimal code changes.

Key Enhancements

1. Flexible Integration : Replaced static client.xml files with three configurable methods – traditional XML, QAE environment variables, and properties files – reducing operational overhead and supporting containerized deployments.

2. Extended Instrumentation : Developed a proxy package offering multiple injection styles – AOP, declarative annotations (service, method, controller, DAO), and bulk property configuration.

3. Health‑Check Module : Added a module that periodically probes services from different data centers; repeated failures trigger alerts, addressing the previous blind spot where client crashes stopped reporting.

4. Alerting Integration : Merged CAT alerts with iQIYI's internal notification channels (email, instant chat, SMS), greatly improving alert visibility and response speed.

Results

New services can be onboarded in under 5 minutes.

Instrumentation is nearly non‑intrusive thanks to the proxy package.

Comprehensive coverage of hardware, health, anomalies, performance, and business metrics across the microservice ecosystem.

Robust alerting enables real‑time detection and rapid remediation of issues.

The LEDAO‑CAT platform now underpins the stability of iQIYI's rapidly expanding services, with ongoing work on distributed transactions, richer business dashboards, and deeper integration with Nacos.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

DeploymentObservabilityAlertingCATsystem-monitoringhealth checks
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.