Bridging Human Perception and Video Motion Generation: VMBench & LD‑RPS
This article introduces VMBench, a perception‑aligned video motion generation benchmark with a five‑dimensional metric suite and meta‑guided prompt generation, and LD‑RPS, a zero‑shot unified image restoration framework using latent diffusion and recurrent posterior sampling, detailing their motivations, innovations, experiments, and future directions.
ICCV is a top computer‑vision conference; this year in Hawaii, five papers from the Gaode team were accepted.
VMBench introduces the first perception‑aligned video motion generation benchmark, addressing the gap between human perception and existing metrics. It defines a five‑dimensional Motion Perception Metrics (PMM) suite—Common‑sense Alignment Score (CAS), Motion Smoothness Score (MSS), Object Integrity Score (OIS), Perceptible Amplitude Score (PAS), and Temporal Consistency Score (TCS)—and a large meta‑guided prompt generation (MMPG) framework covering six motion patterns.
Research Background
Rapid advances in video generation demand accurate motion‑quality assessment. Existing metrics miss human‑perceived smoothness, physical plausibility, and object integrity, and rely on limited prompts.
Paper Highlights
Perception‑Aligned Motion Metrics (PMM) – hierarchical five‑dimensional evaluation inspired by human perception.
Meta‑guided Motion Prompt Generation (MMPG) – extracts subject, place, and action from large video datasets, optimizes prompts with large language models, and validates them via human‑LLM collaboration.
Experimental Results
Human‑perception alignment experiments show PMM correlates strongly (up to 54.5% Spearman) with expert scores, outperforming rule‑based and multimodal large‑model baselines. Ablation studies confirm each metric’s contribution, especially CAS.
Qualitative analysis on six mainstream video models demonstrates diverse strengths and weaknesses, guiding future development.
LD‑RPS: Zero‑Shot Unified Image Restoration
LD‑RPS proposes a latent‑diffusion recurrent posterior sampling framework that restores images without training data. It uses multimodal large language models to generate semantic prompts from the degraded image, a Feature‑Pixel Alignment Module (F‑PAM) to bridge latent and image spaces, and a recurrent posterior sampling strategy for progressive quality improvement.
Paper Highlights
Unsupervised zero‑sample image restoration across multiple degradation types.
F‑PAM aligns diffusion intermediate results with the degraded input.
Recurrent posterior sampling refines the restoration iteratively.
Experimental Results
LD‑RPS achieves state‑of‑the‑art performance on low‑light enhancement, dehazing, denoising, and colorization benchmarks, often surpassing specialized single‑task methods while requiring only a single degraded image as input.
Conclusion and Outlook
VMBench provides a standardized, perception‑aligned evaluation platform for video motion generation, and LD‑RPS demonstrates a versatile, zero‑shot approach to image restoration, both advancing AI research toward more human‑aligned generative systems.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Amap Tech
Official Amap technology account showcasing all of Amap's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
