How UnityVideo Unifies Multimodal Training to Boost Video Generation

UnityVideo, a new vision framework from HKUST, CUHK, Tsinghua and Kuaishou, unifies training across depth, flow, pose, segmentation and RGB modalities, achieving faster convergence, higher video quality, zero‑shot generalization and stronger physical reasoning compared with existing single‑modality video generators.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
How UnityVideo Unifies Multimodal Training to Boost Video Generation

From Text LLMs to Vision LLMs

Large language models (LLMs) such as GPT and Claude achieve strong generalisation by jointly training on multiple text sub‑modalities (natural language, code, math). A similar opportunity exists for vision: most video generators are trained only on RGB, limiting physical understanding.

Motivation for a Unified Multimodal Video Model

Current video generation models excel at pixel quality but ignore depth, motion, pose and segmentation cues. A model that perceives colour, texture, depth, motion trajectories and body structure simultaneously can achieve deeper world understanding.

UnityVideo: Core Idea

UnityVideo trains a single architecture on five visual modalities—RGB video, depth maps, optical flow, DensePose and instance segmentation—using a unified loss. Joint training provides complementary supervision, speeds convergence and yields higher final performance.

Instance Segmentation : helps the model distinguish object categories.

DensePose : teaches the model the structure of human bodies.

Skeletal Information : encodes fine‑grained motion patterns.

Depth Maps : reveal 3‑D geometry of scenes.

Optical Flow : captures pixel‑level motion.

Technical Innovations

Dynamic Task Routing

UnityVideo supports three training paradigms within one architecture:

Conditional Generation : generate RGB video from an auxiliary modality (e.g., depth).

Modality Estimation : predict auxiliary modalities from RGB video.

Joint Generation : generate both RGB video and auxiliary modalities from text.

A dynamic noise‑scheduling strategy randomly selects a training mode each iteration and applies different noise levels to the corresponding tokens, preventing catastrophic forgetting and allowing the three objectives to coexist.

Modality Switcher

Two complementary designs separate modality signals:

Context Learner (In‑Context Learner) : injects modality‑specific textual prompts (e.g., "depth map", "human skeleton") so the model knows which modality it is handling.

Modality‑Adaptive Switcher : learns a distinct embedding for each modality that modulates the AdaLN‑Zero parameters (scale, shift, gate) inside DiT blocks, enabling plug‑and‑play switching at inference time.

Progressive Curriculum Learning

Training proceeds in two stages:

Stage 1 : on carefully filtered single‑person clips, train only pixel‑aligned modalities (flow, depth, DensePose) to build a solid spatial correspondence foundation.

Stage 2 : introduce all modalities and diverse scenes (multi‑person, generic), allowing the model to learn all five modalities and robust zero‑shot inference for unseen modality combinations.

Dataset and Benchmark

The authors constructed the OpenUni dataset with 1.3 M multimodal video samples (370 k single‑person, 97 k double‑person, 489 k clips from Koala36M and 344 k from OpenS2V). Batches are split into four balanced groups to ensure uniform sampling across modalities and data sources.

They also built the UniBench evaluation suite (30 k samples, 200 high‑quality Unreal Engine renders with ground‑truth depth and flow) for comprehensive, fair assessment.

Experimental Results

Quantitative Multi‑Task Performance

UnityVideo outperforms baselines on three task families:

Text‑to‑Video Generation : best scores on all metrics (background consistency 97.44 %, aesthetic quality 64.12 %).

Controllable Generation : superior background/overall consistency and dynamic degree (64.42 %).

Modality Estimation : video segmentation mIoU 68.82 %, depth‑estimation Abs Rel 0.022, far exceeding dedicated single‑task models.

Qualitative Comparisons

UnityVideo shows more accurate physical reasoning (e.g., correct light refraction in water), fewer background flickers and object distortions in controllable generation, and finer edges in depth/flow estimation.

Zero‑Shot Generalisation

Training on “two persons” segmentation enables the model to generalise to unseen “two objects” scenes, demonstrating compositional understanding rather than pattern memorisation.

Ablation Studies

Multimodal Complementarity : joint training yields lower final loss and higher image quality than single‑modality training.

Multi‑Task Necessity : training only the controllable‑generation task degrades performance; unified multi‑task training restores and surpasses it.

Architecture Effectiveness : both the context learner and modality switcher individually improve results, and their combination provides an additional significant boost.

User Study

Human evaluators rated UnityVideo highest on physical quality (38.50 % score), semantic quality and overall preference, surpassing commercial models Kling 1.6 and HunyuanVideo.

Limitations and Future Work

Occasional VAE‑induced artefacts appear; scaling to larger backbones and adding more visual modalities could further enhance emergent abilities.

Conclusion

UnityVideo demonstrates that unifying multiple visual modalities and tasks yields faster convergence, better quantitative metrics, stronger physical reasoning and robust zero‑shot generalisation—mirroring how LLMs benefit from multimodal text training. The work highlights the importance of architecture design, curriculum learning and multidimensional evaluation for building truly world‑understanding video models.

Paper: https://arxiv.org/abs/2512.07831

Code: https://github.com/dvlab-research/UnityVideo

Project page: https://jackailab.github.io/Projects/UnityVideo

Code example

收
藏
,
分
享
、
在
看
,
给
个
三
连
击呗!
AI researchVision Modelsmultimodal video generationUnityVideozero-shot generalization
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.