How iQIYI Built an Unmanned Fault‑Handling System for 99% Reliability
This article details iQIYI's unmanned monitoring platform, covering its design goals, overall architecture, core modules such as real‑time data collection, decision engine, and event‑processing engine, as well as the machine‑learning model used for production‑time prediction and the system's operational results and future roadmap.
Unmanned Monitoring Goal
The unmanned system aims to automate workflow, provide trustworthy results, and eliminate the need for human supervision, ensuring timely detection, automatic remediation, risk alerts, and notifications for iQIYI's content production pipeline.
Overall Architecture
The architecture first collects runtime data from business services via iQIYI's middle‑platform data center, performs real‑time intelligent analysis to detect faults and anomalies, and then hands them to a fault‑handling module for automated resolution, achieving fully unmanned recovery.
System Workflow
Collect production‑stage data into the unmanned system via the middle‑platform data center.
The decision engine performs real‑time analysis based on SLA, thresholds, etc., generating events.
Events are dispatched to the Beacon event‑processing engine.
Beacon executes configurable workflows for fault repair, alarm notification, recovery detection, and statistics.
The training engine uses offline data to train fault‑analysis models that feed back into the decision engine.
Core Modules
Real‑time Data Collection (based on the middle‑platform data center)
OLTP components continuously collect status and progress of each functional module, supporting terabyte‑scale data management. Data are pushed to the data center, a change notification is sent to an RMQ queue, and the unmanned system pulls the data for end‑to‑end tracking.
Decision Engine
Provides error detection, timeout alerts, and configurable policies (SLA, permissions, notifications). It continuously checks service progress, identifies abnormal or timed‑out tasks, and generates corresponding events.
Event Processing Engine (Beacon)
Beacon receives events from the decision engine and executes highly configurable workflows. Different event types (e.g., over 50 transcoding error categories) map to distinct processing chains, notifications, and cure strategies. The engine supports context handling, execution, JSON‑based step configuration, and integration with email, messaging, and business APIs.
Machine Learning: Production Time Estimation
Using the collected historical production data, the system trains predictive models to estimate video production duration, improving the unmanned experience. The model applies only to stable‑resource tasks.
Features include categorical attributes such as businessType, channel, cloudEncode, priority, programType, etc., and numeric attribute duration. After cleaning, an XGBoost regression model is trained, leveraging gradient‑boosted decision trees to predict production time.
Business Feedback
The system generates daily operational reports for each integrated business service, summarizing errors, fault statistics, SLA compliance, and detailed metrics. These reports enable services to iteratively improve and optimize their processes.
Results and Future Directions
Deployed across iQIYI's content middle‑platform, the unmanned system covers critical production stages, achieving over 99% unmanned rate, detecting more than 3,000 issues and automatically handling over 2,800. Future work will expand from point‑level events to whole‑program monitoring, offering end‑to‑end fault detection and recovery.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
