Artificial Intelligence 20 min read

How AnimeReward and GAPO Transform Anime Video Generation with Human Feedback

Researchers at Bilibili present Index‑Anisora, an open‑source anime video generation framework that builds a 30k‑sample reward dataset, introduces the multi‑dimensional AnimeReward model and a Gap‑Aware Preference Optimization (GAPO) method, and demonstrate through extensive automatic and human evaluations that their approach significantly outperforms baseline video generators.

Bilibili Tech

May 20, 2025

How AnimeReward and GAPO Transform Anime Video Generation with Human Feedback

Overview

Index‑Anisora is an open‑source animation video generation model released by Bilibili. It builds on the earlier AniSora work (accepted at IJCAI 25) and adds a reinforcement‑learning framework specifically designed for anime‑style video creation.

Alignment Pipeline and AnimeReward

The authors propose a full alignment pipeline illustrated in Figure 1. They construct a high‑quality reward dataset containing 30,000 human‑annotated anime video clips. Human evaluation covers two major aspects: Visual Appearance (visual smoothness, visual motion, visual appeal) and Visual Consistency (text‑video consistency, image‑video consistency, character consistency). Based on these six dimensions they develop AnimeReward , a multi‑dimensional reward system that uses dedicated vision‑language models for each dimension. To improve training efficiency they introduce Gap‑Aware Preference Optimization (GAPO) , which incorporates the preference gap between positive and negative samples into the loss.

Dataset Construction

To ensure diverse motion categories, the team collected 5,000 real anime videos covering actions such as speaking, walking, waving, kissing, crying, hugging, and pushing/pulling. Each of roughly 30–50 video clips per standardized action label yields a total of 30,000 rewarded samples and a separate 6,000‑clip test set with no overlap in initial frames or prompts. Prompts are generated automatically with the Qwen2‑VL model and refined using the CogVideoX strategy. Five state‑of‑the‑art image‑to‑video generators (Hailuo, Vidu, OpenSora, OpenSora‑Plan, CogVideoX) are used to create diverse outputs.

Reward Model Details

Visual Smoothness

Fine‑tuned Mantis‑8B‑Idefics2 visual encoder with a regression head predicts smoothness scores (formula shown in Figure 2). Variables: I_i – i‑th frame, N – total frames, Ev – visual encoder, Reg – regression head.

Visual Motion

ActionCLIP is employed to score motion intensity. Motion prompts such as “the protagonist performs large‑scale actions like running or dancing” guide the model; cosine similarity between video features (MCLIP) and prompts yields a motion score (Figure 3).

Visual Appeal

An aesthetic regression model is trained on keyframes extracted from videos. The model learns human aesthetic preferences for anime images and outputs an appeal score (Figure 4). Variables: I_i – keyframe, K – number of keyframes, SigLIP – feature encoder, Aes – aesthetic scorer.

Text‑Video Consistency

Both visual and text encoders are fine‑tuned and a regression head maps their joint representation to a consistency score (Figure 5). Variables: Ev – visual encoder, Et – text encoder, Reg – regression head, T – text prompt, V – video.

Image‑Video Consistency

Similar to text‑video consistency but the reference is the input image (Figure 6). Variables: V – video, Ip – input image, Ev – visual encoder, Reg – regression head.

Character Consistency

A multi‑stage process extracts character masks using GroundingDINO, SAM, and a tracking tool, then refines them with a fine‑tuned BLIP model to associate masks with specific anime characters (Figure 7). During inference, cosine similarity between generated character features and a reference library yields a consistency score (Figure 8).

Training and GAPO

GAPO modifies the traditional DPO loss by adding a reward‑gain term:

where α controls the strength of the gain. The gap between the normalized rewards of the preferred (v_w) and dispreferred (v_l) videos weights the loss, amplifying the influence of pairs with large preference differences.

Experiments

Baseline: CogVideoX‑5B. An initial set of 2,000 raw anime images and prompts is used to generate candidate videos. AnimeReward scores each candidate; the highest‑scoring and lowest‑scoring videos form a preference pair, resulting in 2,000 pairs for alignment training.

Evaluation: Automatic metrics (VBench‑I2V, VideoScore, AnimeReward) and human studies with three professional annotators. A video is considered a win only if at least two annotators agree on its superiority.

Results: Across VBench‑I2V, AnimeReward, and VideoScore the proposed GAPO‑aligned model achieves the highest overall scores, especially on Text‑Video Consistency and Character Consistency (Table 1, Table 2). Human evaluation shows a win rate above 60 % compared to baseline and SFT models (Figure 9). Ablation studies confirm that GAPO outperforms standard DPO on all three evaluation suites (Table 3) and that AnimeReward provides a stronger training signal than VideoScore (Table 4, Figure 10).

Conclusion

The paper introduces AnimeReward, the first multi‑dimensional reward model for anime video generation, and GAPO, a gap‑aware preference optimization technique. Experiments demonstrate that even with only baseline‑generated data, the alignment pipeline markedly improves visual quality, consistency, and overall human preference alignment.

References

Peng Wang et al., “Qwen2‑VL: Enhancing vision‑language model’s perception of the world at any resolution,” arXiv:2409.12191, 2024.

Zhuoyi Yang et al., “CogVideoX: Text‑to‑video diffusion models with an expert transformer,” arXiv:2408.06072, 2024.

Zangwei Zheng et al., “Open‑Sora: Democratizing efficient video production for all,” arXiv:2412.20404, 2024.

Bin Lin et al., “Open‑Sora‑Plan: Open‑source large video generation model,” arXiv:2412.00131, 2024.

Dongfu Jiang et al., “Mantis: Interleaved multi‑image instruction tuning,” arXiv:2405.01483, 2024.

Mengmeng Wang et al., “ActionCLIP: A new paradigm for video action recognition,” arXiv:2109.08472, 2021.

Tianhe Ren et al., “Grounding‑DINO 1.5: Advance the ‘edge’ of open‑set object detection,” arXiv:2405.10300, 2024.

Nikhila Ravi et al., “SAM 2: Segment anything in images and videos,” ICLR, 2025.

Junnan Li et al., “BLIP: Bootstrapping language‑image pre‑training for unified vision‑language understanding and generation,” ICML, 2022.

Rafael Rafailov et al., “Direct preference optimization: Your language model is secretly a reward model,” NeurIPS, 2023.

Ziqi Huang et al., “VBench: Comprehensive benchmark suite for video generative models,” CVPR, 2024.

Xuan He et al., “VideoScore: Building automatic metrics to simulate fine‑grained human feedback for video generation,” EMNLP, 2024.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI alignment anime video generation video synthesis reward modeling Human Feedback GAPO

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.