Artificial Intelligence 14 min read

Optimizing Video Inference Services for High GPU Utilization in AI Applications

By moving decoding, color conversion, preprocessing, inference, and re‑encoding entirely onto the GPU and enabling batch processing with flexible Python scripts, iQIYI’s video‑image enhancement service achieved ten‑fold throughput, over 90 % GPU utilization, and dramatically lower resource use, accelerating AI video inference deployment.

iQIYI Technical Product Team

Jul 3, 2020

Optimizing Video Inference Services for High GPU Utilization in AI Applications

In the iQIYI video image enhancement project, a video inference service was launched that increased processing speed by tenfold while keeping GPU utilization above 90%.

The article shares the solution to help practitioners facing low utilization issues.

Background : Video inference is a hot and difficult demand. Deep‑learning models are deployed faster than optimization work can keep up, leading to low cluster utilization. Image‑related business accounts for the largest proportion of iQIYI AI inference services, and video inference includes two major categories: classification (e.g., content moderation, copyright check, face recognition) and transformation (e.g., image enhancement, super‑resolution).

Transformation services have two characteristics: (1) much higher computational demand than classification because every frame must be processed at full resolution; (2) most algorithms require manual integration of deep‑learning models with traditional CV operations.

These characteristics cause two main problems when launching video inference services: insufficient performance optimization and the need to tightly couple inference with codec functions, making the launch cycle dependent on platform engineers.

Existing Solution : The common approach is to split video into frames, call an image inference service, then re‑assemble the results. In the project, the encoding team modified ffmpeg to perform decode → inference → encode in a single pipeline using a custom filter for pre‑/post‑processing. GPU utilization remained low because CPU decoding was slower than GPU inference.

Understanding GPU utilization: it measures the percentage of time kernels are executing on the GPU. To increase utilization, batch size can be increased (e.g., from 1 to 16 images), which can raise throughput 2‑4×.

Why DeepStream Was Not Chosen : Although DeepStream keeps all data on the GPU and offers near‑theoretical throughput, it has drawbacks: custom plugins are required for unsupported ops, integration of complex pre/post‑processing is difficult, and color‑space conversion introduces rounding errors that affect model results.

Proposed Optimization Scheme :

All operations (decode, color conversion, pre‑processing, inference, post‑processing, re‑encoding) are performed on the GPU, supporting batch inference.

The entire pre‑/post‑processing and inference flow is driven by user‑defined Python scripts, providing flexibility and rapid iteration. The environment includes TensorFlow, PyTorch, CuPy, OpenCV, and supports custom CUDA kernels.

The solution forms a complete pipeline: video download → processing → upload, with parallel services.

The request flow is: (1) client submits a task with video source, target, and callback; (2) the entry service records the task in a database; (3) workers pull tasks, execute the GPU‑based inference pipeline, store results in cloud storage, and invoke the callback.

Additional technical details:

Data transfer between CPU and GPU can dominate latency (≈20% of time for 1080p at 2× speed).

To avoid this, the pipeline keeps data in GPU memory throughout.

DLPack is used to share tensors between frameworks without copying; a custom TensorFlow OP was written to consume GPU tensors directly.

Short‑Video Inference Service : Extending the solution to short‑video scenarios (e.g., moderation, tagging) required reducing end‑to‑end latency. By re‑configuring the hardware encoder instead of re‑initializing it, the first‑frame acquisition time dropped from ~110 ms to ~15 ms, making inference time >85 % of total latency.

Results : After fine‑tuning, the video image enhancement pipeline achieved a tenfold throughput increase compared with the original ffmpeg solution, while using only 10 % of the previous resource consumption. GPU utilization stayed above 90 % on V100 GPUs (which delivered 2.57× the throughput of T4 despite lower fp16 rating, due to higher memory bandwidth).

Conclusion : By executing the entire inference workflow on the GPU, supporting custom scripts, and parallelizing data transfer, the project achieved high computational efficiency and reduced engineering effort. The platformized service now offers a closed‑loop from development to monitoring, enabling algorithm engineers to deploy video‑inference models quickly and cost‑effectively, while alleviating low‑utilization problems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Tuning TensorRT AI Deployment GPU Optimization DeepStream video inference

Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.