Ultrafast Video Attention Prediction with Coupled Knowledge Distillation
The paper presents UVA‑Net, a lightweight video‑attention network trained via coupled knowledge distillation, which matches the accuracy of eleven state‑of‑the‑art models while using only 0.68 MB of storage and achieving up to 10,106 FPS on GPU (404 FPS on CPU), thanks to a MobileNetV2‑based CA‑Res block and a teacher‑student framework that leverages low‑resolution inputs to drastically cut parameters and computational cost.
This paper introduces a lightweight network UVA-Net and a coupled knowledge distillation training method for video attention prediction. The proposed approach achieves performance comparable to 11 state-of-the-art models while requiring only 0.68 MB of storage space. On GPU, it achieves 10,106 FPS, and on CPU, 404 FPS, representing a 206x improvement over previous models.
The paper addresses two key challenges in video saliency detection: reducing computational and storage requirements while maintaining processing efficiency, and extracting effective spatiotemporal joint features without accuracy degradation. To tackle these issues, the authors propose a lightweight video saliency detection method using coupled knowledge distillation.
The authors introduce a CA-Res block structure based on MobileNetV2, which significantly improves computational efficiency while maintaining accuracy. The coupled knowledge distillation approach uses low-resolution video frames as input to reduce computational load, then employs complex temporal and spatial networks as teacher models to supervise the training of a simpler spatiotemporal student model, dramatically reducing parameter count and storage requirements.
Experimental results on the AVS1K dataset show that UVA-DVA-64 achieves performance comparable to high-performance models with only 2.73M parameters and 404.3 FPS speed, while UVA-DVA-32, though slightly less accurate, requires only 0.68M parameters and achieves 10,106 FPS.
The proposed ultrafast video saliency detection algorithm demonstrates computational accuracy comparable to 11 international high-level methods, effectively addressing issues of insufficient model generalization and difficulty in combining temporal-spatial cues. The technology has been applied to iQiyi's image search for dramas, intelligent video creation, and other products, where saliency ROI detection significantly aids in understanding image and video content.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.