Boost Black-Box VLMs Without Training: Class-Aware Prompt Reweighting (CARPRT)

The article analyzes the prompt‑sensitivity problem of zero‑shot classification in vision‑language models, critiques class‑agnostic prompt weighting, and presents CARPRT—a training‑free, black‑box compatible method that reweights prompts per class using similarity scores and pseudo‑labels, achieving consistent gains across datasets and model architectures.

Black-Box OptimizationClass-Aware ModelingPrompt Reweighting

0 likes · 11 min read

Boost Black-Box VLMs Without Training: Class-Aware Prompt Reweighting (CARPRT)

HyperAI Super Neural

Mar 12, 2026 · Artificial Intelligence

Stanford’s Merlin: Single‑GPU 3D Abdominal CT Vision‑Language Model Leads 752 Tasks

Stanford researchers introduced Merlin, the first native 3D abdominal CT vision‑language foundation model trained on a single NVIDIA A6000 GPU using a 25,494‑scan dataset, and demonstrated its superiority across 752 benchmark tasks—including zero‑shot classification, phenotype prediction, cross‑modal retrieval, disease forecasting, report generation, and 3D segmentation—outperforming existing baselines.

3D CTDisease PredictionMulti-Task Learning

0 likes · 18 min read

Stanford’s Merlin: Single‑GPU 3D Abdominal CT Vision‑Language Model Leads 752 Tasks

xkx's Tech General Store

Jan 29, 2026 · Artificial Intelligence

Understanding CLIP: The Image‑Text Translator Behind Text‑to‑Image Models

This article explains CLIP’s dual‑encoder architecture, contrastive training, and zero‑shot inference, then demonstrates its use through image‑text matching and CIFAR‑10 classification experiments with code examples, highlighting strengths and limitations such as resolution mismatch.

CLIPImage-Text MatchingPyTorch

0 likes · 11 min read

Understanding CLIP: The Image‑Text Translator Behind Text‑to‑Image Models

AI Algorithm Path

Jun 29, 2025 · Artificial Intelligence

Understanding CLIP: Theory, Architecture, and Zero‑Shot Vision

CLIP (Contrastive Language‑Image Pre‑training) is an OpenAI model that learns visual concepts from 400 million image‑text pairs using a dual‑encoder architecture, enabling zero‑shot classification, flexible text‑driven search, and cross‑modal reasoning, while its strengths, limitations, and emerging applications are examined in detail.

CLIPContrastive Language-Image PretrainingDual Encoder

0 likes · 15 min read

Understanding CLIP: Theory, Architecture, and Zero‑Shot Vision

Network Intelligence Research Center (NIRC)

May 14, 2025 · Artificial Intelligence

Hands‑On CLIP: Implementing Multimodal Vision‑Language Understanding

This article introduces OpenAI’s CLIP multimodal model, explains its architecture and contrastive training, details hardware and installation steps, and demonstrates a hands‑on zero‑shot image classification workflow that achieves 97% confidence on a cat image without any task‑specific fine‑tuning.

CLIPPythonVision-Language

0 likes · 6 min read

Hands‑On CLIP: Implementing Multimodal Vision‑Language Understanding