Artificial Intelligence 7 min read

How CLIP Uses Natural Language Supervision for Powerful Zero‑Shot Vision

This article explains CLIP’s multimodal contrastive pre‑training, its simple yet effective architecture, code implementation, and how its zero‑shot capability can surpass supervised ImageNet models by leveraging a 400‑million image‑text dataset and shared semantic embeddings.

Baobao Algorithm Notes

Mar 7, 2022

How CLIP Uses Natural Language Supervision for Powerful Zero‑Shot Vision

Overview

CLIP (Learning Transferable Visual Models from Natural Language Supervision) is a multimodal pre‑training framework that jointly learns image and text representations through contrastive learning on a large collection of image‑text pairs (≈400 million pairs). The model learns a shared embedding space without requiring task‑specific labels.

Model Architecture

The system consists of two encoders:

Image encoder : a ResNet or Vision Transformer that maps an input image I to a feature vector I_f of dimension d_i.

Text encoder : a CBOW model or a Transformer that maps a tokenized caption T to a feature vector T_f of dimension d_t.

Both feature vectors are linearly projected to a common dimension d_e using learned matrices W_i ∈ ℝ^{d_i×d_e} and W_t ∈ ℝ^{d_t×d_e}, then L2‑normalized to obtain the final embeddings I_e and T_e. A learnable temperature scalar t scales the similarity logits.

Training Objective

For a minibatch of n aligned image‑text pairs, the pairwise cosine similarities form an n×n matrix. The diagonal entries correspond to correct matches. The loss is a symmetric cross‑entropy applied in both directions (image→text and text→image).

# image_encoder – ResNet or Vision Transformer
# text_encoder – CBOW or Text Transformer
# I[n, h, w, c] – minibatch of images
# T[n, l]       – minibatch of texts
# W_i[d_i, d_e] – image projection matrix
# W_t[d_t, d_e] – text projection matrix
# t             – temperature (learned)

I_f = image_encoder(I)               # [n, d_i]
T_f = text_encoder(T)                # [n, d_t]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)
logits = np.dot(I_e, T_e.T) * np.exp(t)   # [n, n]
labels = np.arange(n)
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss = (loss_i + loss_t) / 2

Zero‑Shot Classification

After training, CLIP can classify images without any fine‑tuning. For each target class a textual prompt (e.g., "a photo of a cat ") is encoded, producing a set of class embeddings. An input image is encoded, and its similarity to all class embeddings is computed; the class with the highest similarity is selected. This zero‑shot protocol enables CLIP to surpass a supervised ResNet‑50 on ImageNet.

Key Advantages

Contrastive learning leverages abundant, freely available web image‑text pairs.

The multimodal training yields a unified semantic space that supports both retrieval and classification.

The massive 400 million‑pair dataset provides broad visual and linguistic coverage, despite not being publicly released.

For a complete technical description, see the original paper: https://arxiv.org/abs/2103.00020

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI contrastive learning Multimodal CLIP zero-shot

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.