UNITER: Unified Image‑Text Representation Learning for Vision‑Language Tasks
This article introduces UNITER, a unified image‑text representation learning framework pretrained on four large multimodal datasets, describes its three pretraining tasks (MLM, ITM, MRM), details model architecture, training optimizations, and evaluates performance across six vision‑language downstream tasks, achieving state‑of‑the‑art results.
UNITER is a general-purpose image‑text representation learning model that is pretrained on four large multimodal datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions). By jointly processing image regions and text tokens, it learns a unified embedding that can be fine‑tuned for a variety of vision‑language (V+L) downstream tasks.
The pretraining employs three self‑supervised objectives: Masked Language Modeling (MLM) where random tokens are masked and predicted, Image‑Text Matching (ITM) which classifies whether an image‑text pair is positive or negative, and Masked Region Modeling (MRM) with three variants (MRFR, MRC, MRC‑kl) that mask image regions and predict either their features, class labels, or distribution.
UNITER’s architecture consists of an Image Embedder that combines region and location features, a Text Embedder that adds token and positional embeddings, and a shared Transformer that fuses the two modalities into a joint representation.
To accelerate training, three techniques are applied: Dynamic Batching (grouping samples of similar length to reduce padding), Gradient Accumulation (reducing inter‑GPU communication), and Mixed‑Precision Training (using 16‑bit floats to increase batch size).
After pretraining, the model is fine‑tuned on six V+L downstream tasks—Visual Question Answering, Visual Entailment, NLVR (Natural Language for Visual Reasoning), Visual Commonsense Reasoning, Referring Expression Comprehension, and Image‑Text Retrieval—covering nine datasets. UNITER‑Base outperforms prior models on most datasets, and UNITER‑Large achieves the current best results.
An ablation study shows that combining all four pretraining tasks (MLM + ITM + MRC‑kl + MRFR) yields the strongest performance, and that using all four datasets together further improves results, confirming the benefit of large‑scale multimodal pretraining.
The presentation concludes with a summary of findings and thanks the audience.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.