Accelerating 4K Video Super‑Resolution with TensorRT: iQIYI’s Optimization and Production Practices
iQIYI optimized a 4K video super-resolution model using TensorRT, employing split of graph, operator fusion, custom CUDA kernels, and int8 quantization, achieving tenfold speedup (≈180 ms per 1080p frame) and demonstrating deep customization potential for large‑scale production.
With the upgrade of end‑user playback devices, audience expectations for video quality have risen from HD to 4K and now to 8K. Besides higher resolution, video capture often introduces artifacts such as motion blur, compression loss, lighting issues, and noise. Traditional interpolation‑based frame‑insertion methods are fast but produce blurry results after up‑scaling, and denoising is difficult. Deep‑learning‑based approaches, thanks to their large parameter space, can better fit the denoising process and simultaneously improve resolution and visual detail.
However, deep‑learning models dramatically increase inference time, sometimes taking hours or days per video, which is unacceptable for large‑scale production. This article describes iQIYI’s experience in optimizing a 4K super‑resolution model on GPUs, achieving a ten‑fold speedup.
Challenges of Deploying Complex Models
After the Volta architecture, Nvidia introduced Tensor‑Core, a domain‑specific accelerator (DSA) that speeds up matrix operations in deep learning. Although Tensor‑Core greatly reduces inference latency, its programming model is complex. To lower the barrier, Nvidia built TensorRT, which hand‑crafts kernels for specific tensor shapes and exposes them via a high‑level API.
TensorRT compiles a model by selecting the fastest kernel for each operation and packs them into a binary engine. At runtime, the engine loads the selected kernels and executes them in the model’s inference order.
Despite its convenience, TensorRT has limitations: it only supports NCHW tensors at the API level, while Tensor‑Core prefers NHWC, causing multiple reshape operations that hurt performance. Moreover, many operators (e.g., PixelShuffle, deformable convolutions) are not directly convertible to TensorRT and require custom CUDA kernels.
iQIYI’s Practical Optimizations
iQIYI analyzed TensorRT’s internals and applied two main strategies, dubbed “Split” (拆) and “Merge” (合):
1. Model Split (拆) – The full computation graph is partitioned into sub‑graphs that can be exported to ONNX and compiled into separate TensorRT engines. Unsupported operators are bridged with custom CUDA kernels. For the EDVR model, sub‑graphs containing deformable convolutions (DCN) and PixelShuffle were isolated, re‑implemented in PyTorch, and exported to ONNX.
Figure 1: Splitting the EDVR computation graph.
2. Operator Fusion (合) – For the PixelShuffle operator, TensorRT would insert three kernels (NCHW→NHWC reshape, PixelShuffle, NHWC→NCHW reshape). By merging these reshapes into a single kernel, the redundant kernels are eliminated, reducing latency.
Figure 2: Fusion of PixelShuffle in TensorRT.
The fusion technique required deep inspection of TensorRT’s closed‑source kernels; by recording the execution trace (the “tape recorder” analogy) and manually optimizing the exposed kernels, the three‑kernel sequence was replaced by a single optimized kernel.
TensorRT Int8 Inference
TensorRT added int8 quantization support starting from TensorRT 4, initially using the Pascal‑era dp4a instruction. Later versions (Turing, Ampere) expanded support to int8, int4, and int1. Quantization in TensorRT follows a post‑training workflow: weights are scaled to int8 using a factor SW , and activations are scaled with SI and SO derived from calibration data.
The int8 convolution computation can be expressed as:
(IQ*WQ*SI/SW + B) * SO
which is rearranged to:
IQ*WQ*SO*SI/SW + B*SO
TensorRT merges SO*SI/SW into a single coefficient and stores B*SO in the engine, reducing the operation to a single fused multiply‑add.
iQIYI discovered a bug in TensorRT 7 where bias scaling was incorrectly applied; the issue was fixed in TensorRT 8. To further improve accuracy, iQIYI embedded TensorRT kernels into the PyTorch quantization‑aware training (QAT) loop, ensuring that the training and inference kernels are identical.
Additionally, dynamic output‑scale computation was introduced: instead of using a static calibration scale, the maximum value of the intermediate float16 output is measured at runtime, and a new scale factor is derived for the subsequent int8 conversion.
Figure 5: Overall int8 precision‑enhancement pipeline.
Performance Results
Through the combined “Split”, “Merge”, and additional optimizations (e.g., reducing redundant memory accesses in DCN, fusing leaky ReLU), the EDVR model achieved:
~150 ms latency gain from operator fusion.
380 ms per frame at fp16 precision.
180 ms per 1080p frame with int8 inference.
Figure 6: Step‑wise performance improvements for EDVR.
Outlook
The ten‑fold speedup demonstrates the potential of deep‑customized TensorRT and int8 quantization for video super‑resolution. Future work includes leveraging newer Nvidia architectures (structured sparsity, ultra‑low‑precision networks) and automating the optimization pipeline.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.