How Temporal Residual Modeling Boosts Video Super‑Resolution Performance
This article introduces a novel video super‑resolution framework that unifies low‑ and high‑resolution temporal modeling using adjacent‑frame residual maps, achieving state‑of‑the‑art results on multiple benchmarks while maintaining high speed and flexibility.
Background
Super‑resolution is a classic computer‑vision technique that maps low‑resolution images to high‑resolution ones. With deep learning, convolutional networks have achieved remarkable results for image super‑resolution, prompting research into the more challenging video super‑resolution task, which requires effective temporal modeling to exploit complementary information across frames.
Problems with Existing Methods
Current temporal‑modeling approaches fall into two categories: (1) flow‑based, deformable‑convolution or 3D‑convolution methods that explicitly or implicitly model frame‑to‑frame dynamics, and (2) recurrent hidden‑state accumulation methods that aggregate features over time. Bidirectional recurrent networks improve information balance but suffer from high computational cost and difficulty integrating into causal (real‑time) systems. Moreover, existing frameworks lack a unified strategy for handling both low‑resolution (LR) and high‑resolution (HR) temporal information.
Proposed ETDM Framework
We propose ETDM, a video super‑resolution framework that uses temporal residual maps between adjacent frames to unify LR and HR temporal modeling. In the LR space, the residual map distinguishes low‑change (LV) and high‑change (HV) regions, allowing the network to treat them differently. In the HR space, the residual map acts as a bridge that propagates predictions across arbitrary past and future frames.
ETDM adopts a single‑direction recurrent convolutional network. For each time step the network receives two inputs: (i) an LR frame sequence (previous, current, next) and (ii) HR predictions from the previous step. Three residual heads—Spatial‑Residual, Past‑Residual, and Future‑Residual—predict the current super‑resolved frame and the temporal residual maps for past and future directions.
Temporal Residual Modeling
LV regions correspond to small motions, while HV regions capture larger motions; the HV branch uses larger receptive fields to capture broader motion cues. The residual maps serve as bridges that transfer information forward and backward, enabling a temporal bidirectional optimization mechanism that refines the current frame with complementary data from other time steps.
Memory Mechanism
A memory of length N stores the super‑resolved results of past and future frames. By accumulating residual maps, the framework can propagate information across any temporal distance, updating the memory with each new frame using a defined update formula.
Experiments
We train on the Vimeo‑90K dataset and evaluate on Vid4, SPMCS, UDM10, and REDS4. ETDM achieves state‑of‑the‑art PSNR and SSIM scores, surpassing methods such as EDVR, GOVSR, and BasicVSR while offering a better speed‑accuracy trade‑off. Qualitative comparisons show richer details and more accurate structures.
Conclusion
By unifying temporal modeling with frame‑wise residual maps, ETDM efficiently exploits complementary information in both LR and HR domains, providing flexible propagation, lower computational cost, and superior performance across multiple video super‑resolution benchmarks.
Kuaishou Audio & Video Technology
Explore the stories behind Kuaishou's audio and video technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.