Stanford’s Merlin: Single‑GPU 3D Abdominal CT Vision‑Language Model Leads 752 Tasks
Stanford researchers introduced Merlin, the first native 3D abdominal CT vision‑language foundation model trained on a single NVIDIA A6000 GPU using a 25,494‑scan dataset, and demonstrated its superiority across 752 benchmark tasks—including zero‑shot classification, phenotype prediction, cross‑modal retrieval, disease forecasting, report generation, and 3D segmentation—outperforming existing baselines.
Computed Tomography (CT) is a widely used imaging modality, with abdominal CT accounting for about a quarter of the roughly 300 million CT exams performed worldwide each year. Interpreting a single abdominal CT scan typically takes a radiologist 20 minutes, creating a bottleneck as demand grows and radiology staffing shortages worsen, with projected shortfalls exceeding 19,000 physicians in some regions by 2036.
Vision‑language models (VLMs) such as CLIP have shown that aligning visual and textual embeddings enables zero‑shot learning and, when combined with large language models, can adapt to radiology tasks. Existing VLM research (e.g., BiomedCLIP, LLaVA‑Rad, Med‑PaLMM) focuses on 2‑D images, leaving a gap for native 3‑D abdominal CT analysis and for publicly available training/evaluation datasets.
Filling VLM Training and Evaluation Data Gap
Stanford’s team assembled a large‑scale dataset of 25,494 paired abdominal CT scans and radiology reports, sourced from real hospitals. The dataset includes 10,628,509 axial slices, the "findings" sections of reports (10,051,571 tokens), and structured EHR diagnostic codes (954,013 ICD‑9 entries covering 5,686 unique codes; 2,041,280 ICD‑10 entries covering 10,867 unique codes). Data were split 60 %/20 %/20 % for training, validation, and testing, ensuring no patient appears in multiple splits. Three external validation sets (6,997; 25,986; and 4,872 abdominal CT scans plus 6,243 chest CT scans) were also used.
Multi‑Task Learning and Staged Training Strategy
Merlin employs a dual‑encoder architecture: an I3D‑ResNet152 image encoder (inflated from 2‑D weights) and a Clinical Longformer text encoder supporting 4,096‑token contexts. Training uses two loss functions—binary cross‑entropy for phenotype classification and InfoNCE for contrastive learning of report embeddings. Both encoders use gradient checkpointing and FP16 mixed‑precision. AdamW optimizer with an initial learning rate of 1 × 10⁻⁵ (β = (0.9, 0.999)) and cosine decay over 300 epochs was applied; batch size reached 18 on a single 48 GB NVIDIA A6000 GPU.
Training proceeds in two stages: Stage 1 pre‑trains the image encoder on EHR diagnostic codes; Stage 2 fine‑tunes with radiology reports while retaining Stage 1 knowledge via a reduced weight on the phenotype loss. Stage 1 uses AdamW with LR = 1 × 10⁻⁴ and γ = 0.99 decay; Stage 2 shares the multi‑task hyper‑parameters.
752 Tasks Comprehensive Evaluation
Merlin was evaluated on six task families covering 752 fine‑grained tasks: zero‑shot classification (31 tasks), phenotype classification (692 tasks), zero‑shot cross‑modal retrieval (23 tasks), 5‑year disease prediction (6 tasks), radiology report generation, and 3‑D semantic segmentation.
In zero‑shot classification on 30 internal and external scans, Merlin achieved an internal F1 of 0.741 (95 % CI 0.727‑0.755) and external average F1 of 0.647 (95 % CI 0.607‑0.678), significantly outperforming k=1‑pooled 2‑D OpenCLIP and fine‑tuned 2‑D BioMedCLIP (P < 0.001). Ablation showed that report segmentation contributed a 7.9‑point F1 gain (P < 0.01).
For phenotype classification, Merlin attained a macro‑AUROC of 0.812 (95 % CI 0.808‑0.816), with 258 phenotypes >0.85 AUROC and 102 >0.90 AUROC. It excelled on multi‑organ disease detection, especially liver, kidney, ureter, and gastrointestinal systems.
In zero‑shot cross‑modal retrieval, Merlin outperformed OpenCLIP and BioMedCLIP on both image‑to‑finding and finding‑to‑image queries, benefitting from the longer token capacity of Clinical Longformer (4,096 vs. 77/256 tokens).
For 5‑year disease prediction, fine‑tuned Merlin reached AUROC 0.757 (95 % CI 0.743‑0.772) with 100 % downstream labels, 7 % higher than an ImageNet‑pretrained I3D baseline; with only 10 % labels, AUROC remained 0.708, still 4.4 % above the baseline.
In radiology report generation, Merlin surpassed the RadFM baseline across RadGraph‑F1, BERTScore, ROUGE‑2, and BLEU, producing highly accurate and anatomically coherent reports, though occasional conservative omissions were noted.
For 3‑D semantic segmentation, Merlin with 10 % training data achieved a macro‑average Dice 4.7 % higher than nnUNet; with full data, nnUNet marginally outperformed Merlin by 0.006 Dice, while Merlin still led on 12 of 20 organs when trained on 10 % data, notably improving prostate segmentation by 41 %.
External validation on 44,098 scans from multiple sites showed stable, precise performance despite distribution shifts, outperforming other baselines even on chest‑CT tasks.
Deep Mining of Large‑Scale Multimodal Medical Data
The success of Merlin underscores the potential of VLMs to integrate imaging, structured EHR, and free‑text reports, enabling efficient disease detection, prognosis, and report automation, and moving radiology toward data‑driven decision support.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
HyperAI Super Neural
Deconstructing the sophistication and universality of technology, covering cutting-edge AI for Science case studies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
