Artificial Intelligence 15 min read

Generative Multimodal Pretraining (OFA) and Representational Multimodal Pretraining (ONE-PEACE): Research Overview and Findings

This article reviews Tongyi Lab's work on the OFA framework for generative multimodal pretraining and the ONE-PEACE model for unified multimodal representation learning, detailing their architectures, training strategies, experimental results across vision‑language and audio tasks, and future research directions.

DataFunSummit

Mar 27, 2024

Generative Multimodal Pretraining (OFA) and Representational Multimodal Pretraining (ONE-PEACE): Research Overview and Findings

The article introduces Tongyi Lab's research on general multimodal large models and their integration with cutting‑edge large language models, outlining the motivation for unified multimodal pretraining.

OFA (One For All) is presented as a generative multimodal pretraining framework that consolidates tasks such as image captioning, visual question answering, visual grounding, and text‑to‑image generation into a single sequence‑to‑sequence model built on a Transformer backbone, emphasizing task‑agnostic and modality‑agnostic design.

Key engineering details include tokenizing visual inputs and bounding‑box coordinates, using single‑stream (e.g., UNITER) and dual‑stream (e.g., ViLBERT, LXMERT) strategies, and applying stability‑focused modifications such as LayerNorm variants, sandwich LN, and Magneto.

Experimental results show OFA‑Tiny, OFA‑Base, and OFA‑Large achieving competitive scores on VQA, image captioning (CIDEr), visual grounding (RefCOCO), and classification, often matching or surpassing prior state‑of‑the‑art models while using fewer parameters.

ONE‑PEACE is introduced as a representational multimodal pretraining model that learns a unified representation for vision, language, and audio, inspired by Data2Vec and CLIP, and employs contrastive and feature‑distillation losses to bind modalities.

The model demonstrates strong zero‑shot performance across vision‑language and audio benchmarks and serves as a robust backbone for downstream tasks.

Future work focuses on using large language models as multimodal engines, connecting various modalities to LLMs (e.g., Kosmos, Mini‑GPT‑4), and enabling tool‑calling for tasks such as image generation.

The article concludes by summarizing the contributions of OFA and ONE‑PEACE and outlining three main directions for upcoming research.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

multimodal pretraining vision-language OFA ONE-PEACE

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.