How Human Feedback Supercharges Video Generation – The VideoAlign Pipeline Explained

This article details a new research pipeline that leverages large‑scale human preference data, a multi‑dimensional video reward model, and specialized alignment algorithms to dramatically improve video generation quality, motion fidelity, and text‑video consistency, with open‑source code and benchmarks for reproducibility.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
How Human Feedback Supercharges Video Generation – The VideoAlign Pipeline Explained

Background

Reinforcement learning from human feedback (RLHF) has improved large language models, but its use for video generation is still immature. The main technical challenges are (1) building a high‑quality human preference dataset, (2) training a robust multi‑dimensional video reward model, and (3) incorporating the reward model into the training of video generators.

Four‑Stage Alignment Pipeline

The paper Improving Video Generation with Human Feedback (NeurIPS 2025) proposes a systematic pipeline:

Collect a large‑scale human preference dataset.

Train a video reward model (VideoReward).

Construct a benchmark (VideoGen‑RewardBench) for evaluating reward models.

Apply three alignment algorithms to flow‑matching video generators.

1. Preference Data Construction

A total of 182,000 preference annotations were gathered from 12 state‑of‑the‑art text‑to‑video models. Using 16,000 distinct text prompts, 108,000 video clips were generated, forming 82,000 triple annotations of the form text prompt + video A + video B + human preference. Annotators rated each video on three dimensions—visual quality (VQ), motion quality (MQ), and text‑video alignment (TA)—and also provided absolute Likert scores (1‑5) for each video, yielding a dual‑channel labeling scheme that combines pairwise comparisons with absolute ratings.

2. VideoReward – Multi‑Dimensional Reward Model

Using the preference dataset, the authors train VideoReward and investigate three design choices:

Modeling approach: regression (predict absolute scores) vs. Bradley‑Terry (BT) model (predict pairwise preferences). Experiments show BT consistently outperforms regression, especially with limited data.

Tie handling: Traditional BT discards “Tie” labels. Incorporating ties via a Bradley‑Terry‑with‑Ties (BTT) loss improves robustness and accuracy.

Dimensional decoupling: Shared representations cause interference across VQ, MQ, and TA. Dedicated tokens for each dimension enable independent feature extraction and reduce cross‑dimensional contamination.

3. VideoGen‑RewardBench – Evaluation Benchmark

The benchmark contains 25,000 evaluation triples annotated on VQ, MQ, TA, and an overall quality metric. It evaluates reward models on recent T2V systems (e.g., Sora, Kling) and earlier models (e.g., CogVideo). Two sub‑benchmarks are provided:

VideoGen‑RewardBench: primary benchmark focusing on the latest generation models.

GenAI‑Bench: supplementary benchmark measuring generalization to older models.

4. Alignment Algorithms for Flow‑Matching Generators

Three algorithms are adapted to flow‑matching video models:

Flow‑DPO: Direct preference‑optimization objective applied during training.

Flow‑RWR: Reward‑weighted regression that modifies the diffusion loss with reward signals.

Flow‑NRG (Reward Guidance): Inference‑time steering of sampling using the trained reward model.

The objective is to align the model distribution P_θ(x|c) with the human‑derived preference distribution by minimizing the KL divergence between them.

5. Experimental Findings

Key observations from extensive ablations:

Flow‑DPO yields the largest improvements across VQ, MQ, and TA metrics.

A constant β in Flow‑DPO outperforms a time‑dependent β derived from Diffusion‑DPO, likely because a stable β provides uniform optimization signals.

VideoReward surpasses existing baselines on VideoGen‑RewardBench and generalizes well to GenAI‑Bench.

Qualitative results show sharper frames, smoother motion, and better text‑video alignment after alignment.

Resources

All code, data, and evaluation scripts are publicly released at:

https://github.com/KwaiVGI/VideoAlign

video generationbenchmarkRLHFAI alignmentreward modelingHuman Feedback
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.