NTIRE 2025 UGC Video Enhancement Challenge: Methods and Results
The NTIRE 2025 challenge introduced a new benchmark for user‑generated content video enhancement, detailing a 150‑video dataset, a pairwise subjective evaluation using the Bradley‑Terry model, hardware specifications, and the diverse multi‑stage deep‑learning methods and results of participating teams.
Introduction
With the rapid growth of short‑video platforms such as Kuaishou and TikTok, user‑generated content (UGC) videos are ubiquitous but often suffer from motion blur, noise, color fading, low resolution, and compression artifacts. Enhancing the visual appeal of these videos is crucial for viewer engagement, motivating the creation of a dedicated UGC video enhancement benchmark.
Challenge Overview
The NTIRE 2025 UGC video enhancement challenge provided a dataset of 150 videos split into training (40 videos), three validation sets (20 videos each), and a test set (60 videos, of which 30 are public and 30 are private). The videos were collected from real short‑video platforms and from users on the Yandex Tasks crowdsourcing platform, covering diverse content types and shooting conditions. To ensure variety, the combined set was clustered into 20 groups based on VQMT metrics (blur, noise, brightness flicker, blockiness, spatial and temporal information), and representative videos were manually selected for each split.
Evaluation Protocol
Subjectify.us was used for subjective evaluation. Participants performed pairwise comparisons of video pairs, choosing the more acceptable video or indicating a tie. Each crowd worker evaluated 20 pairs, including two verification pairs with known answers. Over 8,000 workers contributed, and each pair received 10 votes after random balancing. Scores were aggregated with the Bradley‑Terry model, estimating the probability that video i is preferred over video j. Maximum‑likelihood estimation (MLE) provided score estimates, and 95 % confidence intervals were derived using the asymptotic normality of the MLE and the Fisher information matrix. The critical value for a two‑sided 95 % interval is z≈1.96.
Hardware and Baselines
All submissions were evaluated on identical hardware: 2 × Intel Xeon Silver 4216 CPUs @ 2.10 GHz, 188 GB RAM, and an NVIDIA TITAN RTX GPU. For consistency, videos were re‑encoded with FFmpeg at the standard short‑video bitrate of 3000 kbps using the command:
ffmpeg -i INPUT_PATH -c:v libx265 -preset fast -b:v 3000k -pix_fmt yuv420p -an OUTPUT_PATHResults
Stage‑1 validation received 17 submissions, stage‑2 19 submissions, stage‑3 20 submissions, and the final phase accepted 7 valid submissions. Table 1 (shown below) lists the final subjective scores and rankings for the full set (150 videos) as well as the public (120 videos) and private (30 videos) subsets, illustrating consistency across dataset fragments. Figures 1 and 2 present overall performance and the preference matrix, respectively.
Team Methods
ShannonLab
Proposed a four‑stage progressive training framework (TRestore). Stage 1 applies a CLUT with adaptive LUT prediction for color enhancement, offering better inference speed and robustness than alternatives. Stage 2 uses a lightweight U‑Net to remove noise and compression artifacts. Stage 3 builds on BasicVSR++ to stabilize temporal results, ensuring good performance after re‑encoding to 3000 kbps. Stage 4 adapts SwinIR (modified to a U‑Net‑like structure) for final frame refinement, achieving faster inference with the same parameter count. Residual connections link stages 2‑4 to prevent degradation. Training details:
Stage 1: L1 loss, 600 k iterations, LR 1e‑4, batch 32, patch 720.
Stage 2: L2 loss, 600 k iterations, LR 1e‑4, batch 32, patch 640.
Stage 3: 120 k iterations, LR 2e‑4, batch 8, patch 512, 30 frames.
Stage 4: 120 k iterations, LR 1e‑5, batch 8, patch 512, 30 frames.
During inference, the color residual from CLUT is amplified by a factor of 1.2, and feature interpolation is performed between consecutive 30‑frame clips before the up‑sampling layer of BasicVSR++.
DeepView
Adopted a two‑stage cascade: degradation restoration and texture refinement. Stage 1 employs a U‑Net with skip connections, spatial attention, and 26 convolutional layers to correct color shifts, uneven illumination, and compression artifacts. Stage 2 uses 15 cascaded residual blocks with dense connections and channel‑attention to recover high‑frequency details and realistic textures. Training used LDV3, REDS, and a large collection of 4K Pexels videos. Stage 1 training: mixed L1 + perceptual loss, > 600 k iterations, batch 32, 512×512 patches. Stage 2 training: L2, perceptual, LPIPS, and GAN losses, > 300 k iterations plus 50 k fine‑tuning, batch 16, 512×512 patches. Data augmentation includes spatial flips, rotations, and temporal jitter (frame dropping and shuffling).
Nobody
Implemented a two‑stage pipeline. Stage 1 performs color enhancement using a 3D‑LUT and several machine‑learning operators. Stage 2 applies two Real‑ESRGAN‑based GAN models for artifact removal, denoising, de‑blurring, and texture enhancement. An optical‑flow‑based end‑frame compensation addresses shaking in handheld UGC videos.
ChouPiJiang
Based on Real‑ESRGAN (RRDBNet) trained with 70 k wild‑scene FFHQ images and 200 4K YouTube videos. Utilizes a second‑order degradation process (illustrated in Figure 6). Network configuration: channel = 128, growth = 32, blocks = 23.
ByteMM
Uses a two‑stage design. Stage 1 employs a modified RealBasicVSR for signal recovery and artifact removal, trained on synthetic HQ/LQ pairs generated with varied degradations. Stage 2 applies dark‑channel and bright‑channel priors for brightness and color enhancement; this stage is non‑learned and manually tuned to preserve skin tones.
TACO SR
Inspired by recent diffusion‑model advances, proposes PiNAFusion‑Net. Stage 1 features a dual‑branch architecture (fidelity and perception branches) built on an adjustable super‑resolution network, producing complementary outputs that are fused. Stage 2 extracts fine‑grained details with a filter and a trainable module to generate the final enhanced frames. Implemented in PyTorch and optimized with AdamW (initial LR 1e‑5).
Conclusion
The NTIRE 2025 UGC video enhancement challenge established a comprehensive benchmark and robust subjective evaluation pipeline, spurring a variety of multi‑stage deep‑learning solutions that balance perceptual quality, computational efficiency, and practical applicability for real‑world user‑generated video content.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
