Meta’s Open‑Source MILS Enables LLMs to See and Hear Without Training – SOTA on Images, Video, and Audio
The paper introduces MILS, a training‑free multimodal iterative LLM solver that lets large language models perceive and generate across image, video, and audio domains, achieving new state‑of‑the‑art results without any task‑specific data or fine‑tuning.
Overview
MILS (Multimodal Iterative LLM Solver) is a training‑free framework that lets a large language model (LLM) act as a GENERATOR to propose candidate solutions while a pretrained multimodal model (e.g., CLIP) serves as a SCORER to evaluate and feed back scores. The loop repeats until convergence or a preset iteration limit, requiring only test‑time inputs.
Problem Addressed
Current multimodal tasks—image, video, and audio description, generation, editing, and cross‑modal reasoning—typically need task‑specific models trained on curated datasets, limiting generalization to new tasks or modalities.
Proposed Solution
MILS eliminates the need for any task‑specific training by iteratively refining LLM‑generated candidates using a gradient‑free optimization loop driven by multimodal scorers. The method works for any modality where a suitable scorer exists.
Key Technical Components
Generator : Usually an LLM (e.g., Llama 3.1 8B (Dubey et al., 2024)) that receives the task description and optional scorer feedback, then outputs a set of candidate solutions.
Scorer : A pretrained multimodal model such as CLIP, SigLIP, ViCLIP, ImageBind, or PickScore that computes a scalar similarity or preference score for each candidate.
Iterative Feedback : The top‑K scored candidates are fed back to the generator for the next round, optionally guided by an initial candidate pool.
Experiments
Image Captioning : Using Llama 3.1 8B as generator and SigLIP as scorer, MILS starts from 30 K prompts, runs 10 optimization rounds, and selects the top 50 candidates each round. Evaluated on MS‑COCO (5 000 images) with BLEU, METEOR, CIDEr, and SPICE. MILS outperforms ZeroCap and MeaCap, achieving higher METEOR and SPICE scores despite having no caption‑training data.
Video Captioning : The same generator is paired with ViCLIP (ViT‑L/14) as scorer on the MSR‑VTT test set (2 990 videos). Compared against a model trained on HowTo100M (Nagrani et al., 2022) and a VideoCC3M‑trained variant, MILS attains superior CIDEr and METEOR scores, demonstrating zero‑shot transfer.
Audio Captioning : Using ImageBind as scorer and 50 000 audio prompts, MILS is evaluated on the Clotho dataset. It surpasses the zero‑shot baseline ZerAuCaps on METEOR and SPICE, confirming cross‑modal generalization.
High‑Quality Image Generation : MILS rewrites text prompts for diffusion models (LDM and FLUX.1) using the generator, while PickScore evaluates image‑text alignment. Human evaluation via AMT (JUICE framework) shows a clear preference for MILS‑enhanced images in both visual quality and textual fidelity.
Style Transfer & Cross‑Modal Computation : By feeding test images (or audio) to the generator as additional context, MILS performs zero‑shot style transfer using Gram‑matrix loss and even combines modalities (e.g., audio + image → text → image) via ImageBind embeddings.
Ablation Studies
Impact of initial candidate pool size: larger pools yield better final performance, indicating the importance of diverse starting points.
Generator and scorer scale: larger LLMs and CLIP variants improve captioning metrics, with LLM size showing the strongest gains.
Optimization steps: performance improves up to ~10–20 steps before plateauing; scorer loss correlates tightly with downstream metrics.
Conclusion & Future Work
MILS demonstrates that a simple generate‑score loop can give LLMs zero‑shot perception (“see”) and audio understanding (“hear”) across modalities without any task‑specific data. Limitations include dependence on the generator’s ability to produce diverse candidates, scorer accuracy, and optimization speed. Future directions involve faster gradient‑free optimizers, larger multimodal backbones, and extending to 3D or spatial tasks.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
