Artificial Intelligence 7 min read

How CoCa Unifies Image Captioning and Contrastive Learning in Vision-Language Models

This article examines the CoCa model, explaining how it extends CLIP with image captioning by combining contrastive and generative objectives, detailing its architecture, training tricks, and performance gains on ImageNet and zero‑shot benchmarks.

Baobao Algorithm Notes

Jun 7, 2022

How CoCa Unifies Image Captioning and Contrastive Learning in Vision-Language Models

The fifth article in the pre‑training model series introduces CoCa (Contrastive Captioners are Image‑Text Foundation Models), a model that builds on CLIP by adding an image‑captioning pre‑training task while retaining contrastive learning.

It first clarifies three related concepts: Vision Pretraining (pure image pre‑training on large datasets such as ResNet, MAE, ViT), Vision‑Language Pretraining (joint encoding of images and text, e.g., LXMERT), and Image‑Text Foundation Models , which encompass both and can be used as stand‑alone encoders or decoders.

CoCa inherits CLIP’s dual‑encoder contrastive loss and simVLM’s generative decoder, but replaces simVLM’s prefix‑based pre‑training with a simpler alignment of single‑modal text features to image features, easing the difficulty of generative training. The architecture consists of three main blocks (shown in the diagram): a yellow image encoder, a yellow single‑modal text decoder, and a blue multimodal text decoder. Both decoders are masked and temporally offset to prevent information leakage from the single‑modal decoder to the multimodal one.

The model is trained with both contrastive loss (aligning image and text embeddings) and generation loss (producing captions). This dual objective enables three capabilities: (1) pure image representation learning, (2) multimodal representation alignment, and (3) image caption generation.

Empirical results demonstrate strong performance: on ImageNet classification CoCa reaches 91% top‑1 accuracy, and on ImageNet zero‑shot evaluation it achieves over 86%, roughly 10 points higher than CLIP. The article notes that such comparisons can be controversial due to massive pre‑training data and suggests alternative evaluation metrics like Pearson similarity or MAP for retrieval tasks.

For readers interested in implementation details, the open‑source PyTorch code is available at https://github.com/lucidrains/CoCa-pytorch.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

vision-language Image Captioning CoCa

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.