Is CLIP Obsolete? LeCun and Xie's New Multimodal Model Beats Language Supervision

A recent study by LeCun, Xie, and collaborators shows that large‑scale visual self‑supervised learning (Web‑SSL) can match or surpass CLIP on diverse VQA tasks, even without any language supervision, by scaling model size and data volume.

AIWalker
AIWalker
AIWalker
Is CLIP Obsolete? LeCun and Xie's New Multimodal Model Beats Language Supervision

Background

Multimodal vision‑language models typically rely on language‑supervised contrastive training (e.g., CLIP) using billions of image‑text pairs. Self‑supervised learning (SSL) learns from images only. This work asks whether language supervision is required for strong multimodal visual representations.

Method: Web‑SSL (Web‑DINO)

A family of vision‑only SSL models is trained on up to 2 billion web images (MC‑2B) without captions. The training follows the DINOv2 recipe on ViT backbones, scaling model parameters from 1 B to 7 B while keeping the data corpus fixed. For a controlled comparison, CLIP models of identical architecture and data scale are trained with language supervision on the same image set.

Evaluation Protocol

The primary benchmark is the Cambrian‑1 VQA suite, which contains 16 tasks across four categories: General, Knowledge, OCR & Chart, and Vision‑Centric. Classic vision benchmarks (ImageNet‑1k classification and ADE20k semantic segmentation) are also evaluated. Inference resolution experiments use three image sizes: 224 px, 378 px, and 518 px.

Scaling Model Capacity

Web‑DINO models trained on 2 B images with parameter counts 1 B, 3 B, 5 B, and 7 B show near‑log‑linear improvement on OCR & Chart and Vision‑Centric VQA, while CLIP performance saturates after ≈3 B parameters.

At 5 B–7 B parameters, Web‑DINO matches or exceeds CLIP on average VQA and surpasses it on OCR & Chart (text‑heavy) tasks.

CLIP performance plateaus across all VQA categories beyond 3 B parameters, indicating limited scaling benefit.

Scaling Data Quantity

Fixing the model at 7 B parameters (ViT‑L/14) and varying the number of seen images from 1 B to 8 B reveals that General and Knowledge VQA improve up to ≈2–4 B images then plateau, whereas OCR & Chart VQA continues to improve up to 8 B images.

Across all data scales, Web‑DINO outperforms same‑size CLIP on average VQA, with the gap widening at larger data volumes.

Resolution and Fine‑tuning

After pre‑training, each model is fine‑tuned for 20 k steps at three inference resolutions (224 → 378 → 518 px). Higher resolution consistently raises VQA scores, especially for OCR & Chart, narrowing the gap with high‑resolution CLIP variants.

On classic vision tasks, resolution increase yields modest gains; at 518 px Web‑DINO approaches or exceeds SigLIP performance on ImageNet‑1k.

Comparison with Existing Vision Encoders

Table 3 (referenced in the paper) shows Web‑DINO surpassing MetaCLIP, SigLIP, and DINOv2 on both VQA and traditional vision benchmarks despite using roughly five times fewer image‑text pairs. The 7 B Web‑DINO model achieves higher top‑1 accuracy on ImageNet‑1k and higher mIoU on ADE20k than the strongest language‑supervised baselines.

Key Insights

Visual SSL can achieve parity with language‑supervised contrastive pre‑training on multimodal tasks when model and data are scaled.

Scaling benefits are asymmetric: OCR & Chart VQA continues to improve with more data, while General/Knowledge VQA saturates earlier.

Model capacity beyond 7 B parameters remains promising, as SSL shows no saturation within the observed range.

Higher inference resolution is an effective, low‑cost way to boost performance on text‑heavy VQA.

Future Directions

The authors will open‑source the Web‑SSL checkpoints to enable community research on language‑free multimodal training, explore larger models (>7 B), richer data compositions (e.g., more text‑rich images), and further high‑resolution adaptation.

Reference

Paper: https://arxiv.org/pdf/2504.01017

Result table showing that visual SSL, when scaled in model and data size, matches or exceeds language‑supervised models across all evaluation domains, including OCR and chart tasks
Result table showing that visual SSL, when scaled in model and data size, matches or exceeds language‑supervised models across all evaluation domains, including OCR and chart tasks
Web‑DINO model series (1B‑7B parameters) trained solely on web images
Web‑DINO model series (1B‑7B parameters) trained solely on web images
Scaling behavior of Web‑DINO vs CLIP: SSL continues to improve with larger models while CLIP saturates
Scaling behavior of Web‑DINO vs CLIP: SSL continues to improve with larger models while CLIP saturates
Performance vs number of training images: OCR & Chart VQA keeps improving up to 8 B images
Performance vs number of training images: OCR & Chart VQA keeps improving up to 8 B images
Web‑DINO ViT‑7B matches CLIP on VQA without language supervision and surpasses it on classic vision benchmarks
Web‑DINO ViT‑7B matches CLIP on VQA without language supervision and surpasses it on classic vision benchmarks
multimodalmodel scalingCLIPVQAvisual self-supervised learningWeb-SSL
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.