iQiyi's Full-Link Automated Monitoring Platform: Design and Implementation
iQiyi’s full‑link automated monitoring platform unifies tracing, metric and log collection with deep offline and real‑time analysis, delivering a DAG‑based call graph, near‑real‑time ingestion of tens of millions of logs, multi‑dimensional alerts and rapid root‑cause diagnosis that cut error‑lookup time by over 50 % and now serves as a core component of the company’s microservice reference architecture.
iQiyi's technical product team developed a full-link automated monitoring platform to address the challenges of monitoring complex microservice systems, providing unified monitoring standards and basic monitoring capabilities to enhance fault localization, deep analysis, accuracy, and transparency.
The platform builds on existing basic monitoring and log collection, integrating Google Dapper's tracing ideas with enhancements such as caching, offline processing, and deep analysis modules to improve query performance and enable automatic fault diagnosis.
It comprises four core components: link collection (call chains and service topology), metric collection, log collection, and deep analysis. Link collection captures Span records, traces them via Trace ID, and represents call relationships as a directed acyclic graph (DAG). Metric collection unifies disparate business‑line metrics (e.g., success rate, QPS, RT, P999) under a common specification to enable consistent alerting and architecture‑level bottleneck detection. Log collection optimizes the ELK‑based pipeline by using Kafka for buffering, Spark streaming tasks for parallel consumption, storing raw logs in HBase, indexed data in Elasticsearch and Hikv, and applying various performance tunings (CPU/SSD upgrades, batch size adjustments, serialization with ProtoBuf and gzip) to achieve near‑real‑time ingestion of tens of millions of logs per second.
Deep analysis correlates client‑side errors and user feedback with link and backend logs, performing offline and real‑time aggregation to pinpoint faults quickly, supported by rule‑based diagnosis and behavioral analysis, and supplemented by ClickHouse for multi‑dimensional metric aggregation.
The overall architecture presents a three‑layer visualization (business line, service, call), enriches each node with machine and business metrics, uses color coding for warnings and errors, and enables minute‑level metric aggregation and alerting. Benefits include unified metrics, rich alerting, root‑cause analysis, capacity planning, automatic log analysis, and cross‑region call detection.
Since deployment, the platform has filled the mobile‑end monitoring gap, improved fault‑location efficiency, reduced error‑log lookup time by over 50 %, and become part of the company’s microservice reference architecture. Future work will add full‑link pressure testing to simulate online loads and intelligently guide resource scaling.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.