Artificial Intelligence 11 min read

UNITER: Unified Image‑Text Representation Learning for Vision‑Language Tasks

This article introduces UNITER, a unified image‑text representation learning framework pretrained on four large multimodal datasets, describes its three pretraining tasks (MLM, ITM, MRM), details model architecture, training optimizations, and evaluates performance across six vision‑language downstream tasks, achieving state‑of‑the‑art results.

DataFunTalk

Mar 20, 2020

UNITER: Unified Image‑Text Representation Learning for Vision‑Language Tasks

UNITER is a general-purpose image‑text representation learning model that is pretrained on four large multimodal datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions). By jointly processing image regions and text tokens, it learns a unified embedding that can be fine‑tuned for a variety of vision‑language (V+L) downstream tasks.

The pretraining employs three self‑supervised objectives: Masked Language Modeling (MLM) where random tokens are masked and predicted, Image‑Text Matching (ITM) which classifies whether an image‑text pair is positive or negative, and Masked Region Modeling (MRM) with three variants (MRFR, MRC, MRC‑kl) that mask image regions and predict either their features, class labels, or distribution.

UNITER’s architecture consists of an Image Embedder that combines region and location features, a Text Embedder that adds token and positional embeddings, and a shared Transformer that fuses the two modalities into a joint representation.

To accelerate training, three techniques are applied: Dynamic Batching (grouping samples of similar length to reduce padding), Gradient Accumulation (reducing inter‑GPU communication), and Mixed‑Precision Training (using 16‑bit floats to increase batch size).

After pretraining, the model is fine‑tuned on six V+L downstream tasks—Visual Question Answering, Visual Entailment, NLVR (Natural Language for Visual Reasoning), Visual Commonsense Reasoning, Referring Expression Comprehension, and Image‑Text Retrieval—covering nine datasets. UNITER‑Base outperforms prior models on most datasets, and UNITER‑Large achieves the current best results.

An ablation study shows that combining all four pretraining tasks (MLM + ITM + MRC‑kl + MRFR) yields the strongest performance, and that using all four datasets together further improves results, confirming the benefit of large‑scale multimodal pretraining.

The presentation concludes with a summary of findings and thanks the audience.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI vision-language Multimodal Pretraining ITM MLM MRM UNITER

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.