Baobao Algorithm Notes
Jun 7, 2022 · Artificial Intelligence
How CoCa Unifies Image Captioning and Contrastive Learning in Vision-Language Models
This article examines the CoCa model, explaining how it extends CLIP with image captioning by combining contrastive and generative objectives, detailing its architecture, training tricks, and performance gains on ImageNet and zero‑shot benchmarks.
CoCaImage Captioningvision-language
0 likes · 7 min read
