Artificial Intelligence 22 min read

Introducing Ola: A Full‑Modal Language Model from Tsinghua & Tencent that Unifies Image, Video, and Audio Understanding

The article presents Ola, an open‑source full‑modal LLM that uses progressive modality alignment to jointly process text, images, video, and audio, and demonstrates competitive performance across image, video, and audio benchmarks, surpassing many specialized models.

AIWalker

Feb 8, 2025

Introducing Ola: A Full‑Modal Language Model from Tsinghua & Tencent that Unifies Image, Video, and Audio Understanding

Overview

Multimodal large language models have attracted attention for their ability to follow instructions and handle complex inputs such as text, images, video, and audio. Recent successes of GPT‑4o and Gemini motivate research toward a single model that can understand all modalities.

Introduction

This paper proposes Ola , a full‑modal language model that achieves competitive performance on image, video, and audio tasks compared with dedicated models. The core design is a progressive modality‑alignment strategy that gradually expands the modalities supported by the language model.

Model and Method

Ola’s architecture consists of modality‑specific encoders that embed text, image, video, and audio inputs into a unified token stream processed by a large language model (LLM). The LLM generates text tokens, and a speech decoder (CosyVoice) provides streaming audio output.

For visual inputs, Ola adopts a multimodal visual encoder based on OryxViT (initialized from SigLIP‑400M) that preserves the original aspect ratio of each image or video frame. For audio, a dual‑encoder approach is used: Whisper‑v3 for speech and BEATs for music, concatenated along the channel dimension. Text tokens are processed with the tokenizer and embedding layer of the underlying LLM.

The joint alignment module maps modality‑specific features into the text embedding space. A local‑global attention pooling layer downsamples visual features with minimal information loss, and a two‑layer MLP projects all modality embeddings to a shared token space. Special markers indicate the start, separator, newline, and end of each modality.

Progressive Full‑Modal Alignment

The authors identify two key challenges in full‑modal training: modality balance and the audio‑visual connection. Directly mixing all modalities harms benchmark performance (see Fig. 3). Therefore, a three‑stage progressive training schedule is adopted:

Stage 1 – Text‑Image : Train on a pre‑trained LLM (Qwen2.5‑7B) with multi‑modal image‑text data, using MLP alignment, large‑scale pre‑training, and supervised fine‑tuning. Down‑sampling modules are fully trained to compress visual data.

Stage 2 – Image‑Video : Freeze the visual encoder (already trained) and continue fine‑tuning with mixed image‑video data, preserving text‑image performance.

Stage 3 – Video‑Audio : Introduce audio tasks. Visual and audio MLP adapters are trained, and audio‑video data are added to bridge the audio‑visual gap. The model learns to recognize audio and to associate visual and audio information via video.

Video is treated as the bridge between audio and vision because each video frame naturally aligns with its accompanying audio.

Data

Training data are collected from open academic datasets across image, video, and audio domains, plus a newly generated cross‑modal video‑audio dataset.

Image data : 800 k image‑text pairs from LAION, plus ~7.3 M image pairs from LLaVA‑OneVision, Cauldron, Cambrian‑1, MAmmoTH‑VL, PixMo, etc.

Video data : 1.9 M video‑dialogue clips from LLaVA‑Video‑178K, VideoChatGPT‑Plus, LLaVA‑Hound, Cinepile; 1.2 M high‑quality video‑language pairs sampled for training.

Audio data : 1.1 M samples covering ASR (LibriSpeech, GigaSpeech), audio‑text description (AudioCaps, Clotho), speech QA, music‑text description (MusicCaps, MillionSong, MusicNet), and audio QA (WavCaps, Audio‑Caps).

To create cross‑modal video‑audio data, the authors generate subtitles for videos using Whisper‑v3, clean them with language‑model filtering, and then use Qwen2‑VL‑72B to generate QA pairs, producing three QA pairs per video. This yields ~324 k video‑audio samples, which are mixed with the pure audio data for Stage 3.

Experiments and Results

Comprehensive benchmarks evaluate Ola on image, video, and audio understanding.

Image benchmarks : MMBench‑1.1 (84.3 %), MMStar (70.8 %), MMMU (57.0 %), MathVista (68.4 %), AI2D (86.1 %), OCRBench (827).

Video benchmarks : VideoMME (68.4 %); leading performance among 7 B models. Also strong on LongVideoBench and MVBench.

Audio benchmarks : LibriSpeech WER 3.1 %; AIR‑Bench average score 6.41, outperforming existing open‑source full‑modal models and approaching specialized audio models.

Baseline comparisons include image‑centric models (Cambrian‑1, Pixtral‑12B), video‑centric models (VideoCCAM, LLaVA‑Video), and integrated multimodal models (LLaVA‑OneVision, MiniCPM‑V 2.6, InternVL2.5, Qwen2.5‑VL). In audio, SALMONN and Qwen‑2 Audio are used as references.

Analysis

Ablation studies compare three training pipelines: (1) direct mixing of all modalities, (2) balanced sampling, and (3) the proposed progressive alignment. Progressive alignment yields the highest scores across all modalities, confirming the importance of modality‑balanced curricula.

Audio‑only experiments (removing video‑audio data) show a performance drop, highlighting the benefit of joint video‑audio learning despite distribution differences.

Conclusion

Ola demonstrates that a progressive modality‑alignment strategy can produce a powerful, fully open‑source full‑modal LLM capable of competitive image, video, and audio understanding. The architecture, streaming speech decoder, and high‑quality cross‑modal video‑audio data together enable a versatile model that may inspire future research toward more general AI systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Large Language Model benchmark Multimodal Ola progressive alignment

Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.