Generative Multimodal Pretraining (OFA) and Representational Multimodal Pretraining (ONE-PEACE): Research Overview and Findings
This article reviews Tongyi Lab's work on the OFA framework for generative multimodal pretraining and the ONE-PEACE model for unified multimodal representation learning, detailing their architectures, training strategies, experimental results across vision‑language and audio tasks, and future research directions.
The article introduces Tongyi Lab's research on general multimodal large models and their integration with cutting‑edge large language models, outlining the motivation for unified multimodal pretraining.
OFA (One For All) is presented as a generative multimodal pretraining framework that consolidates tasks such as image captioning, visual question answering, visual grounding, and text‑to‑image generation into a single sequence‑to‑sequence model built on a Transformer backbone, emphasizing task‑agnostic and modality‑agnostic design.
Key engineering details include tokenizing visual inputs and bounding‑box coordinates, using single‑stream (e.g., UNITER) and dual‑stream (e.g., ViLBERT, LXMERT) strategies, and applying stability‑focused modifications such as LayerNorm variants, sandwich LN, and Magneto.
Experimental results show OFA‑Tiny, OFA‑Base, and OFA‑Large achieving competitive scores on VQA, image captioning (CIDEr), visual grounding (RefCOCO), and classification, often matching or surpassing prior state‑of‑the‑art models while using fewer parameters.
ONE‑PEACE is introduced as a representational multimodal pretraining model that learns a unified representation for vision, language, and audio, inspired by Data2Vec and CLIP, and employs contrastive and feature‑distillation losses to bind modalities.
The model demonstrates strong zero‑shot performance across vision‑language and audio benchmarks and serves as a robust backbone for downstream tasks.
Future work focuses on using large language models as multimodal engines, connecting various modalities to LLMs (e.g., Kosmos, Mini‑GPT‑4), and enabling tool‑calling for tasks such as image generation.
The article concludes by summarizing the contributions of OFA and ONE‑PEACE and outlining three main directions for upcoming research.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.