How UNIFIEDREWARD Breaks Task Boundaries to Boost Image and Video Performance
The paper introduces UNIFIEDREWARD, the first unified reward model for multimodal understanding and generation that supports pairwise ranking and pointwise scoring, builds a 236K human‑preference dataset across image and video tasks, and uses DPO to align VLMs and diffusion models, achieving significant performance gains on both image and video benchmarks.
Highlights
Constructed a large‑scale human‑preference dataset covering multiple visual tasks and developed UNIFIEDREWARD , the first unified reward model for multimodal understanding and generation, capable of pairwise ranking and pointwise scoring.
Proposed a general workflow for preference alignment of image and video understanding/generation models, a relatively under‑explored area.
Experiments show that jointly learning image and video tasks yields synergistic improvements, extending reward model applicability across visual applications.
Problem Statement
Task‑specific limitation: Existing reward models are designed for single tasks and lack cross‑task adaptability.
High data collection cost: Gathering large‑scale human feedback is time‑consuming and resource‑intensive.
Task isolation: Visual tasks are intrinsically related, yet current methods do not exploit these connections.
Proposed Solution
Unified Reward Model: Introduced UNIFIEDREWARD , the first model that can evaluate both multimodal understanding and generation via pairwise ranking and pointwise scoring.
Large‑scale Dataset: Built a unified human‑preference dataset with ~ 236K entries covering image and video understanding/generation.
Automatic Data Generation: Used UNIFIEDREWARD to generate high‑quality preference pairs, applying multi‑stage filtering (pairwise ranking and pointwise selection) to select outputs from specific baseline models.
Direct Preference Optimization (DPO): Aligned model outputs with human preferences without explicit reward modeling.
Unified Reward Model Training
The base architecture is LLaVA‑OneVision 7B (OV‑7B) . Training hyper‑parameters include batch size 2, gradient accumulation steps 16, learning rate (unspecified), and warm‑up ratio 0.3. The model learns to predict pairwise rankings and pointwise scores, and optionally generates rationales when provided.
Unified Preference Dataset Construction
Data Generation: For each image/video‑question (or generation prompt), a VLM or diffusion model generates multiple candidate outputs.
Pairwise Ranking: Candidates are grouped and ranked pairwise, producing a selected list and a rejected list .
Pointwise Filtering: Both lists are scored pointwise; the final preference pairs are formed from the highest‑scored selected items and the lowest‑scored rejected items.
Example format for pairwise data: “Image/Video X is preferred over Image/Video Y”. When evaluation reasons are available, they are retained to teach the model human reasoning.
Model Alignment
After training UNIFIEDREWARD, it is used to construct preference data for multimodal models, followed by two steps:
Preference Data Construction: Uses the unified model to filter and score candidate outputs.
Model Alignment: Applies DPO to align visual‑language models (VLMs) and diffusion models with the constructed preferences.
DPO for Multimodal Generation
For diffusion models (e.g., SDXL‑Turbo for image generation, T2V‑Turbo for video generation), DPO minimizes the denoising error on preferred samples while increasing it on dispreferred ones, using the loss formulation from prior work [38].
DPO for Multimodal Understanding
For VLMs (e.g., LLaVA‑OneVision 7B, LLaVA‑Video), DPO maximizes the probability of preferred responses and minimizes that of dispreferred ones, encouraging better alignment with human judgments.
Experiments
Setup
Reward Model Backbone: LLaVA‑OneVision 7B.
Multimodal Understanding DPO: Batch size 1, gradient accumulation 16, learning rate 0.1.
Multimodal Generation DPO: Image generation with SDXL‑Turbo (batch size 32), video generation with T2V‑Turbo (batch size 16), 5,000 training steps.
Dataset Scale: Video generation DPO – 10K preference pairs; other tasks – 14K pairs; 10 candidate outputs per prompt; 3 training epochs.
Reward Model Comparison (Image Understanding)
Compared against LLaVA‑Critic, Gemini‑1.5‑Pro, and GPT‑4o. UNIFIEDREWARD achieved 66.5% macro accuracy, surpassing LLaVA‑Critic’s 62.5%.
Reward Model Comparison (Image Generation)
Benchmarked against PickScore, HPSv2, ImageReward, VisionReward. VisionReward supports both image and video generation but trains separate models per task, whereas UNIFIEDREWARD uses a unified framework and attains higher scores across all metrics.
Reward Model Comparison (Video Generation)
Compared with VideoScore, LiFT, VisionReward, VideoReward. Despite fewer video preference pairs, UNIFIEDREWARD’s multi‑task learning yielded the best results, demonstrating that joint learning mitigates data scarcity.
Multi‑Task Evaluation Learning
Three training configurations for image understanding were explored: (1) single‑task, (2) image understanding + image generation, (3) image understanding + video understanding. Joint training improved overall accuracy by 5.3% and macro accuracy by 8.3% compared to single‑task training, confirming the benefit of cross‑task knowledge sharing.
DPO Comparison Results
For image understanding, UNIFIEDREWARD outperformed LLaVA‑Critic on all benchmarks, e.g., a 3.4% gain on LLaVABench. For video understanding, the method significantly exceeded baselines on MSRVTT, MSVD, and TGIF. For image generation, using the Pick‑a‑Pic dataset, UNIFIEDREWARD surpassed direct training on the raw data. For video generation, it outperformed VideoDPO, improving both quality and semantic consistency.
Qualitative Results
Figures show side‑by‑side comparisons of generated images and videos before and after applying UNIFIEDREWARD‑guided DPO, illustrating clearer details and better alignment with prompts.
Conclusion
The paper presents UNIFIEDREWARD , the first unified reward model for multimodal understanding and generation, capable of pairwise ranking and pointwise scoring. By fine‑tuning a pretrained VLM on a 236K‑sample unified dataset and using DPO for alignment, the approach achieves notable performance gains across image and video tasks, demonstrating that multi‑task joint learning enhances both robustness and generalization of visual models.
References
[1] Unified Reward Model for Multimodal Understanding and Generation.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
