Artificial Intelligence 22 min read

NTIRE 2025 UGC Video Enhancement Challenge: Methods and Results

The NTIRE 2025 challenge introduced a new benchmark for user‑generated content video enhancement, detailing a 150‑video dataset, a pairwise subjective evaluation using the Bradley‑Terry model, hardware specifications, and the diverse multi‑stage deep‑learning methods and results of participating teams.

AIWalker

Jun 2, 2025

NTIRE 2025 UGC Video Enhancement Challenge: Methods and Results

Introduction

With the rapid growth of short‑video platforms such as Kuaishou and TikTok, user‑generated content (UGC) videos are ubiquitous but often suffer from motion blur, noise, color fading, low resolution, and compression artifacts. Enhancing the visual appeal of these videos is crucial for viewer engagement, motivating the creation of a dedicated UGC video enhancement benchmark.

Challenge Overview

The NTIRE 2025 UGC video enhancement challenge provided a dataset of 150 videos split into training (40 videos), three validation sets (20 videos each), and a test set (60 videos, of which 30 are public and 30 are private). The videos were collected from real short‑video platforms and from users on the Yandex Tasks crowdsourcing platform, covering diverse content types and shooting conditions. To ensure variety, the combined set was clustered into 20 groups based on VQMT metrics (blur, noise, brightness flicker, blockiness, spatial and temporal information), and representative videos were manually selected for each split.

Evaluation Protocol

Subjectify.us was used for subjective evaluation. Participants performed pairwise comparisons of video pairs, choosing the more acceptable video or indicating a tie. Each crowd worker evaluated 20 pairs, including two verification pairs with known answers. Over 8,000 workers contributed, and each pair received 10 votes after random balancing. Scores were aggregated with the Bradley‑Terry model, estimating the probability that video i is preferred over video j. Maximum‑likelihood estimation (MLE) provided score estimates, and 95 % confidence intervals were derived using the asymptotic normality of the MLE and the Fisher information matrix. The critical value for a two‑sided 95 % interval is z≈1.96.

Hardware and Baselines

All submissions were evaluated on identical hardware: 2 × Intel Xeon Silver 4216 CPUs @ 2.10 GHz, 188 GB RAM, and an NVIDIA TITAN RTX GPU. For consistency, videos were re‑encoded with FFmpeg at the standard short‑video bitrate of 3000 kbps using the command:

ffmpeg -i INPUT_PATH -c:v libx265 -preset fast -b:v 3000k -pix_fmt yuv420p -an OUTPUT_PATH

Results

Stage‑1 validation received 17 submissions, stage‑2 19 submissions, stage‑3 20 submissions, and the final phase accepted 7 valid submissions. Table 1 (shown below) lists the final subjective scores and rankings for the full set (150 videos) as well as the public (120 videos) and private (30 videos) subsets, illustrating consistency across dataset fragments. Figures 1 and 2 present overall performance and the preference matrix, respectively.

Team Methods

ShannonLab

Proposed a four‑stage progressive training framework (TRestore). Stage 1 applies a CLUT with adaptive LUT prediction for color enhancement, offering better inference speed and robustness than alternatives. Stage 2 uses a lightweight U‑Net to remove noise and compression artifacts. Stage 3 builds on BasicVSR++ to stabilize temporal results, ensuring good performance after re‑encoding to 3000 kbps. Stage 4 adapts SwinIR (modified to a U‑Net‑like structure) for final frame refinement, achieving faster inference with the same parameter count. Residual connections link stages 2‑4 to prevent degradation. Training details:

Stage 1: L1 loss, 600 k iterations, LR 1e‑4, batch 32, patch 720.

Stage 2: L2 loss, 600 k iterations, LR 1e‑4, batch 32, patch 640.

Stage 3: 120 k iterations, LR 2e‑4, batch 8, patch 512, 30 frames.

Stage 4: 120 k iterations, LR 1e‑5, batch 8, patch 512, 30 frames.

During inference, the color residual from CLUT is amplified by a factor of 1.2, and feature interpolation is performed between consecutive 30‑frame clips before the up‑sampling layer of BasicVSR++.

DeepView

Adopted a two‑stage cascade: degradation restoration and texture refinement. Stage 1 employs a U‑Net with skip connections, spatial attention, and 26 convolutional layers to correct color shifts, uneven illumination, and compression artifacts. Stage 2 uses 15 cascaded residual blocks with dense connections and channel‑attention to recover high‑frequency details and realistic textures. Training used LDV3, REDS, and a large collection of 4K Pexels videos. Stage 1 training: mixed L1 + perceptual loss, > 600 k iterations, batch 32, 512×512 patches. Stage 2 training: L2, perceptual, LPIPS, and GAN losses, > 300 k iterations plus 50 k fine‑tuning, batch 16, 512×512 patches. Data augmentation includes spatial flips, rotations, and temporal jitter (frame dropping and shuffling).

Nobody

Implemented a two‑stage pipeline. Stage 1 performs color enhancement using a 3D‑LUT and several machine‑learning operators. Stage 2 applies two Real‑ESRGAN‑based GAN models for artifact removal, denoising, de‑blurring, and texture enhancement. An optical‑flow‑based end‑frame compensation addresses shaking in handheld UGC videos.

ChouPiJiang

Based on Real‑ESRGAN (RRDBNet) trained with 70 k wild‑scene FFHQ images and 200 4K YouTube videos. Utilizes a second‑order degradation process (illustrated in Figure 6). Network configuration: channel = 128, growth = 32, blocks = 23.

ByteMM

Uses a two‑stage design. Stage 1 employs a modified RealBasicVSR for signal recovery and artifact removal, trained on synthetic HQ/LQ pairs generated with varied degradations. Stage 2 applies dark‑channel and bright‑channel priors for brightness and color enhancement; this stage is non‑learned and manually tuned to preserve skin tones.

TACO SR

Inspired by recent diffusion‑model advances, proposes PiNAFusion‑Net. Stage 1 features a dual‑branch architecture (fidelity and perception branches) built on an adjustable super‑resolution network, producing complementary outputs that are fused. Stage 2 extracts fine‑grained details with a filter and a trainable module to generate the final enhanced frames. Implemented in PyTorch and optimized with AdamW (initial LR 1e‑5).

Conclusion

The NTIRE 2025 UGC video enhancement challenge established a comprehensive benchmark and robust subjective evaluation pipeline, spurring a variety of multi‑stage deep‑learning solutions that balance perceptual quality, computational efficiency, and practical applicability for real‑world user‑generated video content.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

deep learning benchmark video enhancement subjective evaluation NTIRE 2025 UGC video

Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Introduction

Challenge Overview

Evaluation Protocol

Hardware and Baselines

Results

Team Methods

ShannonLab

DeepView

Nobody

ChouPiJiang

ByteMM

TACO SR

Conclusion

AIWalker

How this landed with the community

Was this worth your time?

0 Comments

TACO SR