Operations 10 min read

Full‑Link Load Testing of iQIYI Playback Service: Process, Tools, and Outcomes

iQIYI implemented full‑link load testing of its playback service using LoadMaker for traffic generation and Rover for link control, mapping the topology, creating weighted user scenarios, and safely pressurizing production‑like environments, which validated multi‑times historical peak capacity, uncovered bottlenecks, and enabled several performance and disaster‑recovery improvements without impacting real users.

iQIYI Technical Product Team

May 10, 2024

Full‑Link Load Testing of iQIYI Playback Service: Process, Tools, and Outcomes

The playback chain is the most critical service for iQIYI. With rapid user growth and the promotion of hot dramas, the playback pipeline often faces unpredictable traffic spikes, which directly affect user viewing experience. Conducting full‑link load testing has therefore become essential.

Why link‑level load testing? Single‑machine or single‑system tests cannot reflect the real capacity of the online environment because online capacity is influenced by network communication, resource contention, database interactions, cluster management, cache efficiency, and load imbalance. Moreover, user requests traverse the entire link, and capacity planning for each subsystem may be misaligned, making it difficult to locate bottlenecks without a full‑link approach.

Online vs. test environment – Online testing is more accurate because the production environment is far more complex than any test‑bed. Replicating the exact configuration is costly and often infeasible.

Risk mitigation for online testing – Prevent test data from contaminating production data, avoid polluting caches, do not trigger unintended circuit‑breaker or degradation mechanisms, and ensure external dependencies are not overwhelmed.

Practice process

Map the playback link topology: identify core services, risk points, technical dependencies, and key data needed for scenario construction.

Tool research: LoadMaker for pressurizing traffic (high QPS, large word‑list scheduling, gradient/pausing capabilities) and Rover for link control (traffic interception, data isolation via shadow databases/tables, comprehensive monitoring of QPS, latency, CPU, memory, middleware, upstream services).

Construct test scenarios: define user types, video attributes (duration, pay type, membership), and generate weighted word‑lists using cross‑combination of attributes.

Address word‑list generation bottleneck by containerizing the generation process on Jenkins/IKS, enabling parallel generation across multiple containers.

Testing objectives – Evaluate playback link capacity (QPS) under different scenarios (e.g., Spring Festival Gala with >4 h long videos accounting for 50 % of traffic, hot dramas with 60 % paid traffic). Adjust target QPS based on historical peaks and aim for at least n‑times the historical maximum.

Testing plan

Prepare a checklist covering all pre‑test steps, environment validation, and contingency measures.

Execute a staged pressurization strategy: gradual ramp‑up, traffic bursts, real‑time monitoring, and immediate stop if issues arise.

Collect metrics from Rover and business monitoring platforms: QPS, latency, CPU, memory, upstream service QPS, business error rate, circuit‑breaker triggers, cache hit rate, and database/middleware performance.

Analyze data to assess capacity, identify bottlenecks, and evaluate disaster‑recovery mechanisms.

Results

First online full‑link load test of the playback chain with zero impact on real users.

Validated capacity at multiple times the historical peak QPS.

Coordinated 7 expansions, 5 disaster‑recovery optimizations, and 4 performance issue fixes.

Ensured stable playback during high‑traffic events such as the Spring Festival Gala and popular dramas.

Summary and outlook – After two quarters of cross‑team collaboration, the playback full‑link load‑testing capability reached a milestone. Future work includes improving user simulation, enhancing word‑list generation, supporting external‑network pressurization, multi‑link and multi‑scenario coordination.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

system reliability capacity planning Load Testing performance engineering iQIYI playback service

Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.