UniVidX Sets New SOTA on Multiple Video Tasks – A Unified Multimodal Framework Presented at SIGGRAPH 2026

UniVidX, a unified multimodal framework for video generation and understanding accepted at SIGGRAPH 2026, reformulates diverse video graphics tasks as conditional generation, achieving or surpassing state‑of‑the‑art performance while demonstrating strong data efficiency and cross‑domain generalization.

Machine Heart
Machine Heart
Machine Heart
UniVidX Sets New SOTA on Multiple Video Tasks – A Unified Multimodal Framework Presented at SIGGRAPH 2026

Recent work by HKUST MMLab and collaborators, titled UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors , has been accepted at the top computer‑graphics conference SIGGRAPH 2026.

The paper identifies a long‑standing fragmentation in video graphics: tasks such as inverse rendering, relighting, matting, inpainting, and text‑to‑video are traditionally tackled with separate, task‑specific models, limiting knowledge sharing and adaptability to complex, multimodal real‑world scenarios.

UniVidX addresses this by unifying all video‑graphics tasks into a single multimodal conditional generation problem, enabling "any modality to any modality" modeling. In this unified space, RGB video, albedo, illumination, normals, alpha channels, and foreground/background are jointly modeled through shared generation mechanisms.

The core technical contributions are:

Random condition mask: during training the division between input and target modalities is constantly shuffled, forcing the model to learn full‑directional generation rather than fixed mappings.

Decoupled LoRA gating: separate low‑rank adaptation parameters are allocated per modality and dynamically activated when that modality serves as the generation target, preventing cross‑modality interference while preserving the pre‑trained diffusion prior.

Cross‑modal self‑attention: shared attention across modalities enforces geometric, lighting, and semantic consistency, markedly improving result coherence.

Two representative models are built on the framework:

UniVid‑Intrinsic handles intrinsic properties (RGB, albedo, irradiance, normals) and supports tasks such as text‑to‑intrinsic generation, inverse rendering, forward rendering, and relighting.

UniVid‑Alpha focuses on video‑level decomposition and synthesis, modeling mixed video, foreground, background, and alpha channels to enable matting, inpainting, and background replacement.

Both models support three generation paradigms—Text→X, X→X, and Text&X→X—covering fifteen typical video tasks. Extensive quantitative evaluations show UniVidX consistently outperforms existing methods on intrinsic generation, RGBA synthesis, inverse/forward rendering, normal estimation, and video matting, achieving higher PSNR, SSIM, LPIPS, MAD, and MSE scores.

Data‑efficiency experiments reveal that UniVidX reaches or exceeds SOTA performance even with fewer than one thousand training videos, demonstrating that the framework leverages the dynamic world priors embedded in pre‑trained video diffusion models rather than relying on massive task‑specific datasets.

Cross‑domain tests on the real‑world MAW dataset show strong albedo estimation performance despite training only on synthetic data, confirming robust generalization.

Beyond individual task gains, UniVidX’s unified architecture enables flexible composition of tasks: for example, one can first perform inverse rendering to extract physical attributes and then apply text‑driven relighting or material editing, or use alpha decomposition for inpainting and background replacement. This positions UniVidX as a general‑purpose video graphics engine rather than a collection of isolated tools.

Overall, the work demonstrates that with a powerful pre‑trained diffusion prior and carefully designed multimodal conditioning mechanisms, traditional graphics tasks—decomposition, estimation, generation, and editing—can be integrated into a single, extensible model, paving the way for advanced applications in autonomous driving simulation, embodied AI, and film production.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

diffusion modelsdata efficiencymultimodal video generationSIGGRAPH 2026UniVidXvideo graphics
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.