Feb 8, 2025 · Artificial Intelligence

Introducing Ola: A Full‑Modal Language Model from Tsinghua & Tencent that Unifies Image, Video, and Audio Understanding

The article presents Ola, an open‑source full‑modal LLM that uses progressive modality alignment to jointly process text, images, video, and audio, and demonstrates competitive performance across image, video, and audio benchmarks, surpassing many specialized models.

BenchmarkLarge Language ModelMultimodal

0 likes · 22 min read

Introducing Ola: A Full‑Modal Language Model from Tsinghua & Tencent that Unifies Image, Video, and Audio Understanding

progressive alignment

Introducing Ola: A Full‑Modal Language Model from Tsinghua & Tencent that Unifies Image, Video, and Audio Understanding