Beginner’s Guide to Vision‑Language Models Day 7: How CLIP Achieves Joint Visual‑Language Understanding

This article explains CLIP’s dual‑encoder architecture—using a Vision Transformer for images and a Transformer for text—how both encoders map inputs into a shared embedding space, the role of cosine similarity, and the InfoNCE contrastive loss that drives joint visual‑language learning.

AI Algorithm Path
AI Algorithm Path
AI Algorithm Path
Beginner’s Guide to Vision‑Language Models Day 7: How CLIP Achieves Joint Visual‑Language Understanding

Introduction

In the previous posts we saw CLIP’s impressive zero‑shot performance. This article dives into the architecture and training mechanism that enable simultaneous visual and textual understanding.

Dual‑Encoder Design

CLIP employs two independent encoders that produce embeddings in a common semantic space.

Image encoder : converts raw pixels into image embeddings using a Vision Transformer (ViT).

Text encoder : converts natural‑language descriptions into text embeddings with a Transformer architecture similar to GPT‑2.

Similarity between the two embeddings is measured with cosine similarity.

Why ViT for Images?

Traditional convolutional networks are replaced by ViT because self‑attention can model relationships among image patches, providing global context and more robust representations.

Patch embedding : the image is split into non‑overlapping patches, each linearly projected and added with positional encodings before feeding into the Transformer.

Advantage : self‑attention captures interactions between all patches, leading to better generalisation.

Text Encoder Details

The text encoder tokenises the input into sub‑word units, embeds them, adds positional encodings and processes the sequence with a decoder‑only Transformer variant.

Advantage : the model captures contextual relationships between words, producing semantically rich text embeddings.

Shared Embedding Space Alignment

Both encoders map their outputs into a common fixed‑dimensional vector space. The training objective forces semantically related image‑text pairs (e.g., a cat image and the caption “a photo of a cat”) to be close, while unrelated pairs are pushed apart.

InfoNCE Contrastive Loss

The loss is computed over all possible image‑text pairs in a batch. First, similarity scores for every pair are obtained, then a softmax converts them into a probability distribution. The correct pair should obtain the highest probability; otherwise the model is penalised. Symmetric optimisation aligns both image→text and text→image directions simultaneously.

Compute similarity matrix (N×N) using cosine similarity.

Apply softmax over rows and columns.

InfoNCE loss encourages diagonal elements (correct matches) to be large and off‑diagonal elements to be small.

Training Process

A typical training step consists of:

Forward pass : feed a batch of N image‑text pairs into the two encoders.

Projection : project the resulting embeddings into the shared space and normalise them.

Similarity matrix : compute the N×N cosine similarity matrix.

Loss computation : calculate the InfoNCE loss.

Backward pass : compute gradients and update both encoders with an optimiser such as Adam.

Early‑stopping can be applied by monitoring validation loss.

Reference Implementation

The article outlines a simplified PyTorch‑style training loop that includes the image_encoder (ViT), text_encoder (Transformer), a learnable temperature parameter, the InfoNCE loss, and an early‑stopping hook.

contrastive learningTransformerPyTorchCLIPvision transformerInfoNCEMulti-modal Embedding
AI Algorithm Path
Written by

AI Algorithm Path

A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.