Design and Optimization of iQIYI Mobile APM Network Monitoring System
The iQIYI mobile APM system provides real‑time, user‑level network monitoring with classified error detection, cloud‑controlled SDK sampling, second‑level backend storage, and web dashboards, while employing DNS three‑layer caching, weak‑network grading, gateway multiplexing, super‑pipeline proxies and layered retry strategies, reducing Android error rates from 5.3 % to 0.48 % and iOS from 4.63 % to 0.35 %.
Background: Enterprises need to monitor the quality and performance of online applications from the code perspective, leading to the emergence of Application Performance Monitoring (APM) systems, which are essential infrastructure for internet companies.
The iQIYI mobile APM system was built to provide real‑time, user‑level network monitoring, covering error rate, hijack rate, and network performance. It complements backend monitoring and includes modules for crash, network, stutter, log, memory, image, etc.
System Design
Network errors are classified into three categories: network‑layer errors (e.g., ConnectException), HTTP response errors (e.g., 404), and parsing errors (e.g., malformed data despite 200 response). The last request result in a data set determines success or failure.
SDK design considers sampling to balance phone performance and data accuracy, using cloud‑controlled sampling, batch compression, and retry‑on‑failure mechanisms.
Backend design supports second‑level real‑time, large‑scale storage (tens of millions of records), flexible queries, and minute‑level multi‑dimensional alerts via email, SMS, etc.
Web monitoring provides quick overview, top‑list of error counts/rates, and detailed analysis pages.
Optimization
DNS optimization includes three‑layer cache (memory + network + local persistence), HTTPDNS to avoid ISP DNS hijacking, and TTL management.
Weak‑network model defines a six‑level grading based on uplink/downlink speed, failure rate, latency, etc.; any factor in VERY_POOR or POOR marks the network as weak.
Weak‑network optimizations comprise Brotli compression, reduced concurrency, priority queues for important domains, and payload size reduction.
Gateway solution introduces a middle‑service that multiplexes long‑connections, achieving 0‑RTT multi‑path reuse, DNS‑hijack avoidance, and protobuf‑encoded private protocol, reducing average request latency from 595 ms to 371 ms.
Super‑pipeline provides HTTP/HTTPS IP direct connection for low‑cost, high‑availability proxy, reducing error rate from 28.96 % to 3.95 % in a failure case.
Retry strategies include raw retry, HTTPS downgrade, HTTP/2 downgrade, IP direct retry, and super‑pipeline fallback.
Additional techniques such as competitive connections, TLS 1.3, and connection pre‑building further improve reliability.
Results
After deployment, iQIYI APP network error rates dropped dramatically: Android error rate fell from 5.3 % to 0.48 %, iOS from 4.63 % to 0.35 %.
The system is now used across multiple iQIYI apps (e.g., iQIYI, iQIYI Movie Ticket, iQIYI Show), and future work includes full‑link monitoring, QUIC integration, and further performance enhancements.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.