Artificial Intelligence 14 min read

Bridging Human Perception and Video Motion Generation: VMBench & LD‑RPS

This article introduces VMBench, a perception‑aligned video motion generation benchmark with a five‑dimensional metric suite and meta‑guided prompt generation, and LD‑RPS, a zero‑shot unified image restoration framework using latent diffusion and recurrent posterior sampling, detailing their motivations, innovations, experiments, and future directions.

Amap Tech

Jul 9, 2025

Bridging Human Perception and Video Motion Generation: VMBench & LD‑RPS

ICCV is a top computer‑vision conference; this year in Hawaii, five papers from the Gaode team were accepted.

VMBench introduces the first perception‑aligned video motion generation benchmark, addressing the gap between human perception and existing metrics. It defines a five‑dimensional Motion Perception Metrics (PMM) suite—Common‑sense Alignment Score (CAS), Motion Smoothness Score (MSS), Object Integrity Score (OIS), Perceptible Amplitude Score (PAS), and Temporal Consistency Score (TCS)—and a large meta‑guided prompt generation (MMPG) framework covering six motion patterns.

Research Background

Rapid advances in video generation demand accurate motion‑quality assessment. Existing metrics miss human‑perceived smoothness, physical plausibility, and object integrity, and rely on limited prompts.

Paper Highlights

Perception‑Aligned Motion Metrics (PMM) – hierarchical five‑dimensional evaluation inspired by human perception.

Meta‑guided Motion Prompt Generation (MMPG) – extracts subject, place, and action from large video datasets, optimizes prompts with large language models, and validates them via human‑LLM collaboration.

Experimental Results

Human‑perception alignment experiments show PMM correlates strongly (up to 54.5% Spearman) with expert scores, outperforming rule‑based and multimodal large‑model baselines. Ablation studies confirm each metric’s contribution, especially CAS.

Qualitative analysis on six mainstream video models demonstrates diverse strengths and weaknesses, guiding future development.

LD‑RPS: Zero‑Shot Unified Image Restoration

LD‑RPS proposes a latent‑diffusion recurrent posterior sampling framework that restores images without training data. It uses multimodal large language models to generate semantic prompts from the degraded image, a Feature‑Pixel Alignment Module (F‑PAM) to bridge latent and image spaces, and a recurrent posterior sampling strategy for progressive quality improvement.

Paper Highlights

Unsupervised zero‑sample image restoration across multiple degradation types.

F‑PAM aligns diffusion intermediate results with the degraded input.

Recurrent posterior sampling refines the restoration iteratively.

Experimental Results

LD‑RPS achieves state‑of‑the‑art performance on low‑light enhancement, dehazing, denoising, and colorization benchmarks, often surpassing specialized single‑task methods while requiring only a single degraded image as input.

Conclusion and Outlook

VMBench provides a standardized, perception‑aligned evaluation platform for video motion generation, and LD‑RPS demonstrates a versatile, zero‑shot approach to image restoration, both advancing AI research toward more human‑aligned generative systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

video generation Diffusion Models AI research Image Restoration perception-aligned evaluation

Written by

Amap Tech

Official Amap technology account showcasing all of Amap's technical innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.