Artificial Intelligence 19 min read

How a Simplified Transformer Enables Lightweight CLIP Training on a Single RTX3090

This paper presents SiCLIP, a framework that simplifies the Transformer architecture, combines weight‑sharing, multi‑stage knowledge distillation, and a novel pair‑matching loss with synthetic captions to train a competitive CLIP model using only one RTX3090 GPU and 1 TB of storage, achieving state‑of‑the‑art data‑size‑parameter‑accuracy trade‑offs.

AIWalker

Jan 10, 2025

How a Simplified Transformer Enables Lightweight CLIP Training on a Single RTX3090

Introduction

Contrastive Language‑Image Pre‑training (CLIP) delivers strong zero‑shot performance but its large parameter count and data requirements make training infeasible on consumer‑grade hardware. The authors target two challenges: (1) reducing trainable parameters while preserving model knowledge, and (2) augmenting a small dataset to improve convergence.

They propose SiCLIP, which trains a CLIP‑like model on a single Nvidia RTX3090 (24 GB VRAM) with only 1 TB of storage.

Related Work

Prior efforts such as MobileCLIP, TinyCLIP, and various knowledge‑distillation (KD) and weight‑inheritance (WI) methods aim to shrink CLIP. Data‑augmentation techniques, including synthetic caption generation, have also been explored to improve dataset quality.

Methods

The SiCLIP pipeline consists of four components:

Simplified Model Structure : The original MobileCLIP‑S0 architecture is rebuilt using SAS‑P blocks, which replace standard Pre‑LN blocks and eliminate shortcut connections. Weight sharing across SAS‑P blocks reduces the image encoder’s parameters by ~14% compared with MobileCLIP‑S0 and ~11% relative to OpenAI‑B/16.

Weight Inheritance (WI) : Pre‑trained MobileCLIP‑S0 weights are inherited for unchanged modules, treating them as a backbone. Only the new SAS‑P blocks are fine‑tuned on the small dataset, cutting gradient memory and allowing larger batch sizes.

Multi‑Stage Knowledge Distillation (WIKD) : Distillation is applied in three stages—single‑modal feature space, contrastive relation space, and interactive contrastive relation space—using a teacher MobileCLIP‑S0. The final distillation loss combines feature, contrastive, and interactive terms with learnable temperature τ.

Pair‑Matching (PM) Loss & Synthetic Dataset : Each image receives multiple synthetic captions generated by the COCA model, forming the CC12M‑SYN dataset. A binary matching task predicts whether an image‑text pair matches, using logits derived from positive and sampled negative pairs.

Experiments

Implementation Details : Training runs for 32 epochs on RTX3090 with AdamW, batch size 1536, weight decay 0.1, learning rate 0.001, and λ₁=4000, λ₂=λ₃=1, λ₄=0.1. Warm‑up is applied for the first 10 k iterations.

Zero‑Shot Retrieval : On MS‑COCO, SiCLIP surpasses all models trained on ≤20 M samples. On Flickr30k, it matches TinyCLIP’s performance while using ~3% of the training samples and 14% fewer image‑encoder parameters, and outperforms larger models such as DataComp‑B/32 and OpenAI‑X.

Zero‑Shot Classification : Across ImageNet‑1k, ImageNet‑V2, ImageNet‑R, and ImageNet‑S, SiCLIP achieves higher top‑1 accuracy than other models trained on similarly sized datasets, approaching the state‑of‑the‑art DataComp‑B/16 despite using far less data.

Inference Speed : On an Intel Xeon Silver‑4314 CPU, SiCLIP processes 39.5 images/s for a 1000‑image batch, slightly faster than MobileCLIP‑S0’s 38.2 images/s, demonstrating the efficiency of SAS‑P blocks.

Ablation Studies :

Training on CC12M‑SYN reduces loss faster than on raw CC12M and yields better zero‑shot performance, confirming the benefit of synthetic captions.

Comparisons of baseline, WI‑only, WIKD‑only, and combined WIKD + PM configurations show that WI improves classification (+13.0 acc1) and retrieval (+15.9 R@1); WIKD adds further gains (+25.4 acc1); the full SiCLIP (WIKD + PM) attains the highest scores, validating both components.

Conclusion

SiCLIP demonstrates that a carefully simplified Transformer, combined with weight sharing, multi‑stage distillation, and a pair‑matching loss on an augmented synthetic dataset, enables competitive CLIP training on consumer‑level hardware. The approach reduces model size, speeds inference, and can be adapted to other domains.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data augmentation model compression Transformer CLIP Knowledge Distillation Lightweight Training Synthetic Captions

Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.