Nine Institutions Unveil Comprehensive Survey of Audio‑Visual Intelligence in the Large‑Model Era

A joint survey by nine leading research groups maps a decade of audio‑visual intelligence (AVI) progress, presenting an evolution tree, unified taxonomy, three core strands, and six future research axes that together chart the role of AVI in large‑foundation models.

Machine Heart
Machine Heart
Machine Heart
Nine Institutions Unveil Comprehensive Survey of Audio‑Visual Intelligence in the Large‑Model Era

1. Nine Institutions Release the First Audio‑Visual Large‑Model Survey

Recent work by NUS, Oxford, Microsoft Research, and six other universities constitutes, to the authors' knowledge, the first systematic review dedicated to audio‑visual intelligence (AVI) in large foundation models. The paper assembles a ten‑year development timeline, proposes a unified taxonomy, three main research strands, and six future research axes, positioning AVI as a core capability alongside single‑modal language models.

2. A Decade‑Long Evolution Tree (2016‑2026)

The authors divide AVI progress into four eras:

Era 1 (2016‑2018) – AV Alignment: Works such as L3‑Net, AVTS, Wav2Lip, Audio2Head focus on aligning audio and visual streams.

Era 2 (2019‑2022) – Scaled Representation: Large‑scale contrastive methods (XDC, AVID, VATT) appear, while audio‑native LLMs like SpeechGPT, SALMONN, Qwen‑Audio emerge.

Era 3 (2023‑2024) – AV Creation: Models such as MBT, AV‑HuBERT, Diff‑Foley, MMAudio, AudioGPT, Mini‑Omni, NExT‑GPT push cross‑modal generation and AV controllers to the forefront.

Era 4 (2024‑2026) – Omni‑Modal Integration: ImageBind, Qwen‑Omni, JavisDiT, MovieGen, Veo‑3, Seedance 2.0, HappyHorse, GPT‑4o, OpenVLA, and Audio‑VLA bring native AV fusion, synchronized generation, and real‑time interaction to the stage.

The paper highlights six persistent bottlenecks across all eras: audio‑visual sync, temporal consistency, controllable generation, evaluation metrics, real‑time latency, and safety/governance.

3. Unified Taxonomy: Perception / Generation / Interaction

The taxonomy splits AVI into three strands:

Understanding the World (Perception): Classic tasks such as AV‑ASR, lip reading, active speaker detection, sound source localization, AV event understanding, cross‑modal retrieval, and AV‑QA, plus emerging long‑video reasoning with AV‑LLM.

Creating the World (Generation): Four categories—conditional generation, cross‑modal generation, joint AV generation, and AV editing—cover video dubbing, audio‑driven video synthesis, and joint generation. The authors note that truly native AV generation is just beginning, with models like MovieGen, Veo‑3, Seedance 2.0, JavisDiT, and HappyHorse still lacking full cross‑identity, cross‑duration, and physically consistent sync.

Interacting with the World (Interaction): Two sub‑streams: AV dialogue (from cascaded ASR + LLM + TTS to native omni‑modal models such as GPT‑4o and Qwen‑Omni) and embodied intelligence/robotics (AV navigation, scene understanding, and operation via projects like SoundSpaces, AVLMaps, OpenVLA, Audio‑VLA).

4. Foundational Technologies: Representation, Generation, LLM‑Centric

Technical foundations are organized into three blocks:

Representation: Audio/visual feature extraction, VAE‑based compression, tokenization, and cross‑modal alignment. The key question shifts from “do the features align?” to “which tokenization best injects AV signals into LLMs?”

Generation: The survey enumerates five generation paradigms—VAE, GAN, Diffusion, Autoregressive (AR), and Masked Autoregressive (MAR)—detailing their capability boundaries, evolution of diffusion/flow‑matching, AR advances for audio and vision, and hybrid AR + Diffusion directions.

LLM‑Centric Paradigms: Typical architectures are classified as Encoder + LLM, LLM + Generator, unified Encoder + LLM + Decoder, and agentic/VLA systems, providing a quick reference for engineering teams building “AV‑GPT‑4o‑style” backbones.

5. Application Landscape: Six Major Directions

The paper groups downstream uses of AVI into six categories:

AIGC & Creative Content: Video dubbing, Foley synthesis, cross‑language lip sync, music generation, and end‑to‑end native‑audio video models.

Digital Humans & Social Interaction: From 2D lip‑sync (Wav2Lip) to 3D neural rendering (GaussianTalker) and high‑fidelity full‑body avatars (EmoGene, EMAGE, Stereo‑Talker).

Human‑Centric Services: Audio‑LLMs (Qwen‑Audio, SALMONN) powering conversational assistants, transcription, AI tutoring, and accessibility tools.

Immersive & Metaverse Experiences: Spatial audio reasoning, AV‑NeRF, AVLMaps, and sub‑20 ms latency constraints.

Embodied AI & Robotics: AV navigation, AV scene understanding, and operation via projects such as SoundSpaces, OpenVLA, π0, GR00T, SmolVLA.

Ubiquitous Perception & Safety Governance: Smart‑city, industrial IoT, deep‑fake detection, acoustic anomaly detection, watermarking, data compliance, privacy, and edge deployment.

6. Six Future Research Axes (2024‑2026+)

The authors propose six structural capabilities that differentiate AVI from generic multimodal learning:

Causal Event – Sound Source Grounding: Modeling delay, occlusion, background noise, and multi‑source mixing to achieve causal audio‑visual alignment.

AV World Model: Treating audio‑visual streams as complementary evidence for geometry, material, dynamics, affordance, and social state, with spatial audio reasoning as a core skill.

Long‑Range AV Context Memory: Building streaming, scenario‑aware, semantic multi‑layer memory beyond simple context‑window extension.

Causal AV Intervention & Controllable Generation: Enabling localized, causal, synchronized edits of objects, sounds, identities, emotions, space, and time.

Verifier & Reward Ecosystem: Moving beyond proxy metrics (FAD, FVD, CLIP, SyncNet) toward evaluators that measure grounding, physical plausibility, audio indispensability, long‑term consistency, and task utility.

Interactive & Responsible AV: Ensuring low‑latency, privacy‑preserving, copyright‑aware, watermark‑enabled, and compliant real‑time AV collaborators.

7. Implications for the Industry

The survey provides a unified coordinate system for AVI research, helping practitioners locate their work within the broader landscape and identify adjacent technology stacks. It argues that the next competitive frontier for omni‑modal models lies not in basic hearing or seeing, but in unified backbones that support long‑range AV reasoning, native sync generation, and real‑time closed‑loop interaction. The authors also note a shift in evaluation: future benchmarks will combine task utility, physical realism, and safety traceability, moving away from purely subjective or proxy scores.

Overall, the paper serves as a comprehensive reference for teams building AV large models, omni‑modal systems, video generation pipelines, digital humans, embodied agents, spatial audio, or deep‑fake detection.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

multimodalgenerationPerceptionInteractionSurveyLarge Foundation ModelsAudio-Visual Intelligence
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.