Generative Multimodal Pretraining (OFA) and Representational Multimodal Pretraining (ONE-PEACE): Research Overview and Findings

This article reviews Tongyi Lab's work on the OFA framework for generative multimodal pretraining and the ONE-PEACE model for unified multimodal representation learning, detailing their architectures, training strategies, experimental results across vision‑language and audio tasks, and future research directions.

DataFunSummit
DataFunSummit
DataFunSummit
Generative Multimodal Pretraining (OFA) and Representational Multimodal Pretraining (ONE-PEACE): Research Overview and Findings

The article introduces Tongyi Lab's research on general multimodal large models and their integration with cutting‑edge large language models, outlining the motivation for unified multimodal pretraining.

OFA (One For All) is presented as a generative multimodal pretraining framework that consolidates tasks such as image captioning, visual question answering, visual grounding, and text‑to‑image generation into a single sequence‑to‑sequence model built on a Transformer backbone, emphasizing task‑agnostic and modality‑agnostic design.

Key engineering details include tokenizing visual inputs and bounding‑box coordinates, using single‑stream (e.g., UNITER) and dual‑stream (e.g., ViLBERT, LXMERT) strategies, and applying stability‑focused modifications such as LayerNorm variants, sandwich LN, and Magneto.

Experimental results show OFA‑Tiny, OFA‑Base, and OFA‑Large achieving competitive scores on VQA, image captioning (CIDEr), visual grounding (RefCOCO), and classification, often matching or surpassing prior state‑of‑the‑art models while using fewer parameters.

ONE‑PEACE is introduced as a representational multimodal pretraining model that learns a unified representation for vision, language, and audio, inspired by Data2Vec and CLIP, and employs contrastive and feature‑distillation losses to bind modalities.

The model demonstrates strong zero‑shot performance across vision‑language and audio benchmarks and serves as a robust backbone for downstream tasks.

Future work focuses on using large language models as multimodal engines, connecting various modalities to LLMs (e.g., Kosmos, Mini‑GPT‑4), and enabling tool‑calling for tasks such as image generation.

The article concludes by summarizing the contributions of OFA and ONE‑PEACE and outlining three main directions for upcoming research.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

multimodalpretrainingvision-languageOFAONE-PEACE
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.