High Availability and Disaster Recovery Practices of iQIYI's Video Relay Service (VRS)
iQIYI’s Video Relay Service ensures uninterrupted video playback by employing a two‑region, three‑center hybrid cloud architecture, multi‑layer storage, cross‑AZ retry mechanisms, protective rate‑limiting and degradation paths, layered monitoring, and rigorous stress‑testing and chaos engineering to achieve high availability and disaster recovery.
iQIYI's Video Relay Service (VRS) is the entry point for all video playback functions on the platform. It provides playback policy control, stream selection, and video file address delivery. Because VRS is critical to user experience, its fault tolerance and recovery capabilities must be extremely robust.
The article first outlines the VRS system architecture and then details the practical measures taken to achieve high availability and disaster recovery across six aspects: system deployment, storage architecture, retry mechanisms, system resilience, monitoring & inspection, and capability verification.
System Deployment : VRS adopts a two‑region three‑center strategy, combining public and private clouds with both containers and virtual machines. Each availability zone (AZ) can operate independently, providing full service capacity. The deployment includes:
Each AZ is self‑sufficient and can handle peak traffic.
Capacity in each AZ can support the entire site’s peak load.
All carrier entrances are supported in each region and AZ.
Traffic switching at the gateway layer (7‑layer load balancer) for AZ or region failures, with DNS‑level fallback when the gateway itself fails.
Hybrid deployment mixes containers and VMs, allowing flexible traffic routing between them via the gateway layer. Multi‑cloud deployment separates workloads between private and public clouds, with independent units for the Dash entry service and health‑check‑driven instance management. A unified traffic‑cutover plan is defined for single‑cloud failures.
Storage Architecture : To mitigate database‑related failures, VRS uses multi‑layer caching and heterogeneous backup storage. The stack includes a self‑developed KV store (HiKV), MySQL, MongoDB, and Couchbase (multi‑write for synchronization). The design ensures that high‑concurrency requests do not overwhelm the underlying storage.
Retry Mechanisms :
Service‑side cross‑AZ retry : When a microservice call fails (timeout, error code, or invalid response), the request is automatically retried in a neighboring AZ. In case of massive failures, a circuit‑breaker directs traffic to the healthy AZ.
Client‑side retry via a dedicated Retry domain : A separate DNS name resolves to a different AZ than the primary domain. If the server indicates a retry is needed, the client re‑issues the request using the Retry domain, enabling client‑initiated cross‑AZ recovery.
System Resilience includes two main techniques:
Protection : Per‑service rate limiting on QPS, thread count, and CPU utilization prevents overload‑induced crashes. Traffic is also graded, giving lower priority to pre‑load or auto‑play requests, which can be throttled without affecting core user playback.
Degradation : When critical services (e.g., DRM) become unavailable, VRS falls back to a clear‑stream service or a Java‑based heterogeneous playback system, ensuring basic playback functionality.
Monitoring & Inspection : The monitoring system is divided into four layers—basic (CPU, memory, bandwidth), service (QPS, error rate, latency), external dependency, and functional (playback duration ratios). Alerts are routed via IM or phone. A proactive inspection dashboard shows global VRS status and detailed metric views, enabling early detection of small incidents before they cascade.
Capability Verification :
Full‑link stress testing : Real‑traffic‑derived wordlists generate realistic load; a Hadoop/Spark‑based platform provides massive concurrency; coordination with upstream services ensures capacity matching; network isolation limits impact on production.
Chaos engineering : Regular fault injection tests target databases, middleware, external dependencies, AZ failures, and traffic spikes to validate disaster‑recovery mechanisms and rate‑limiting effectiveness.
In conclusion, VRS’s high‑availability design—spanning multi‑region deployment, hybrid cloud, robust storage, sophisticated retry and degradation strategies, comprehensive monitoring, and rigorous testing—demonstrates a practical, production‑validated approach to building resilient backend services for large‑scale video streaming.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.