Artificial Intelligence 6 min read

Beginner’s Guide to CLIP Inference: Step‑by‑Step with Hugging Face

This tutorial walks through loading the openai/clip‑vit‑base‑patch32 model with Hugging Face, preprocessing images and text, encoding them into a shared embedding space, computing cosine similarity for zero‑shot image‑text matching, and visualizing the results, all with concrete code examples.

AI Algorithm Path

Jul 1, 2025

Beginner’s Guide to CLIP Inference: Step‑by‑Step with Hugging Face

Load CLIP model

Use the Hugging Face Transformers library to load the pre‑trained model openai/clip-vit-base-patch32. The architecture consists of an image encoder ViT‑B/32 (Vision Transformer with 32×32 patches) and a GPT‑style text encoder. This variant is lightweight enough to run on a GPU with 4 GB memory while retaining strong zero‑shot performance.

CLIPProcessor integrates a tokenizer and an image pre‑processor; it automatically resizes and normalizes raw images and tokenizes text. CLIPModel receives the processed inputs and outputs vectors that map both modalities into a shared embedding space.

Preprocess image

Load an example image from a URL or local file and pass it to the CLIPProcessor. The processor handles the required size adjustment and standardization internally.

Define text prompts

text_labels = ["一张狗的照片", "一张猫的照片", "一匹马的照片"]

Encode and compute similarity

Run the processor on the image and the list of text prompts. The processor returns image and text embeddings. Because CLIP does not produce direct class probabilities, compute cosine similarity between the image embedding and each text embedding; the highest similarity score indicates the caption that best matches the image.

Visualization

Plot the similarity scores as a bar chart; taller bars represent stronger semantic alignment between the image and the corresponding text.

Batch processing

Extend the same pipeline to handle multiple images and multiple captions, demonstrating CLIP’s ability to scale to batch‑level inference.

Code repository: https://github.com/deepalim100/CLIP-playground/blob/main/clip-inference.ipynb

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python CLIP zero-shot cosine similarity Hugging Face image-text similarity

Written by

AI Algorithm Path

A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.