Is CLIP Obsolete? LeCun and Xie's New Multimodal Model Beats Language Supervision
A recent study by LeCun, Xie, and collaborators shows that large‑scale visual self‑supervised learning (Web‑SSL) can match or surpass CLIP on diverse VQA tasks, even without any language supervision, by scaling model size and data volume.
Background
Multimodal vision‑language models typically rely on language‑supervised contrastive training (e.g., CLIP) using billions of image‑text pairs. Self‑supervised learning (SSL) learns from images only. This work asks whether language supervision is required for strong multimodal visual representations.
Method: Web‑SSL (Web‑DINO)
A family of vision‑only SSL models is trained on up to 2 billion web images (MC‑2B) without captions. The training follows the DINOv2 recipe on ViT backbones, scaling model parameters from 1 B to 7 B while keeping the data corpus fixed. For a controlled comparison, CLIP models of identical architecture and data scale are trained with language supervision on the same image set.
Evaluation Protocol
The primary benchmark is the Cambrian‑1 VQA suite, which contains 16 tasks across four categories: General, Knowledge, OCR & Chart, and Vision‑Centric. Classic vision benchmarks (ImageNet‑1k classification and ADE20k semantic segmentation) are also evaluated. Inference resolution experiments use three image sizes: 224 px, 378 px, and 518 px.
Scaling Model Capacity
Web‑DINO models trained on 2 B images with parameter counts 1 B, 3 B, 5 B, and 7 B show near‑log‑linear improvement on OCR & Chart and Vision‑Centric VQA, while CLIP performance saturates after ≈3 B parameters.
At 5 B–7 B parameters, Web‑DINO matches or exceeds CLIP on average VQA and surpasses it on OCR & Chart (text‑heavy) tasks.
CLIP performance plateaus across all VQA categories beyond 3 B parameters, indicating limited scaling benefit.
Scaling Data Quantity
Fixing the model at 7 B parameters (ViT‑L/14) and varying the number of seen images from 1 B to 8 B reveals that General and Knowledge VQA improve up to ≈2–4 B images then plateau, whereas OCR & Chart VQA continues to improve up to 8 B images.
Across all data scales, Web‑DINO outperforms same‑size CLIP on average VQA, with the gap widening at larger data volumes.
Resolution and Fine‑tuning
After pre‑training, each model is fine‑tuned for 20 k steps at three inference resolutions (224 → 378 → 518 px). Higher resolution consistently raises VQA scores, especially for OCR & Chart, narrowing the gap with high‑resolution CLIP variants.
On classic vision tasks, resolution increase yields modest gains; at 518 px Web‑DINO approaches or exceeds SigLIP performance on ImageNet‑1k.
Comparison with Existing Vision Encoders
Table 3 (referenced in the paper) shows Web‑DINO surpassing MetaCLIP, SigLIP, and DINOv2 on both VQA and traditional vision benchmarks despite using roughly five times fewer image‑text pairs. The 7 B Web‑DINO model achieves higher top‑1 accuracy on ImageNet‑1k and higher mIoU on ADE20k than the strongest language‑supervised baselines.
Key Insights
Visual SSL can achieve parity with language‑supervised contrastive pre‑training on multimodal tasks when model and data are scaled.
Scaling benefits are asymmetric: OCR & Chart VQA continues to improve with more data, while General/Knowledge VQA saturates earlier.
Model capacity beyond 7 B parameters remains promising, as SSL shows no saturation within the observed range.
Higher inference resolution is an effective, low‑cost way to boost performance on text‑heavy VQA.
Future Directions
The authors will open‑source the Web‑SSL checkpoints to enable community research on language‑free multimodal training, explore larger models (>7 B), richer data compositions (e.g., more text‑rich images), and further high‑resolution adaptation.
Reference
Paper: https://arxiv.org/pdf/2504.01017
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
