Meta’s Open‑Source MILS Enables LLMs to See and Hear Without Training – SOTA on Images, Video, and Audio

The paper introduces MILS, a training‑free multimodal iterative LLM solver that lets large language models perceive and generate across image, video, and audio domains, achieving new state‑of‑the‑art results without any task‑specific data or fine‑tuning.

AIWalker
AIWalker
AIWalker
Meta’s Open‑Source MILS Enables LLMs to See and Hear Without Training – SOTA on Images, Video, and Audio

Overview

MILS (Multimodal Iterative LLM Solver) is a training‑free framework that lets a large language model (LLM) act as a GENERATOR to propose candidate solutions while a pretrained multimodal model (e.g., CLIP) serves as a SCORER to evaluate and feed back scores. The loop repeats until convergence or a preset iteration limit, requiring only test‑time inputs.

Problem Addressed

Current multimodal tasks—image, video, and audio description, generation, editing, and cross‑modal reasoning—typically need task‑specific models trained on curated datasets, limiting generalization to new tasks or modalities.

Proposed Solution

MILS eliminates the need for any task‑specific training by iteratively refining LLM‑generated candidates using a gradient‑free optimization loop driven by multimodal scorers. The method works for any modality where a suitable scorer exists.

Key Technical Components

Generator : Usually an LLM (e.g., Llama 3.1 8B (Dubey et al., 2024)) that receives the task description and optional scorer feedback, then outputs a set of candidate solutions.

Scorer : A pretrained multimodal model such as CLIP, SigLIP, ViCLIP, ImageBind, or PickScore that computes a scalar similarity or preference score for each candidate.

Iterative Feedback : The top‑K scored candidates are fed back to the generator for the next round, optionally guided by an initial candidate pool.

Experiments

Image Captioning : Using Llama 3.1 8B as generator and SigLIP as scorer, MILS starts from 30 K prompts, runs 10 optimization rounds, and selects the top 50 candidates each round. Evaluated on MS‑COCO (5 000 images) with BLEU, METEOR, CIDEr, and SPICE. MILS outperforms ZeroCap and MeaCap, achieving higher METEOR and SPICE scores despite having no caption‑training data.

Video Captioning : The same generator is paired with ViCLIP (ViT‑L/14) as scorer on the MSR‑VTT test set (2 990 videos). Compared against a model trained on HowTo100M (Nagrani et al., 2022) and a VideoCC3M‑trained variant, MILS attains superior CIDEr and METEOR scores, demonstrating zero‑shot transfer.

Audio Captioning : Using ImageBind as scorer and 50 000 audio prompts, MILS is evaluated on the Clotho dataset. It surpasses the zero‑shot baseline ZerAuCaps on METEOR and SPICE, confirming cross‑modal generalization.

High‑Quality Image Generation : MILS rewrites text prompts for diffusion models (LDM and FLUX.1) using the generator, while PickScore evaluates image‑text alignment. Human evaluation via AMT (JUICE framework) shows a clear preference for MILS‑enhanced images in both visual quality and textual fidelity.

Style Transfer & Cross‑Modal Computation : By feeding test images (or audio) to the generator as additional context, MILS performs zero‑shot style transfer using Gram‑matrix loss and even combines modalities (e.g., audio + image → text → image) via ImageBind embeddings.

Ablation Studies

Impact of initial candidate pool size: larger pools yield better final performance, indicating the importance of diverse starting points.

Generator and scorer scale: larger LLMs and CLIP variants improve captioning metrics, with LLM size showing the strongest gains.

Optimization steps: performance improves up to ~10–20 steps before plateauing; scorer loss correlates tightly with downstream metrics.

Conclusion & Future Work

MILS demonstrates that a simple generate‑score loop can give LLMs zero‑shot perception (“see”) and audio understanding (“hear”) across modalities without any task‑specific data. Limitations include dependence on the generator’s ability to produce diverse candidates, scorer accuracy, and optimization speed. Future directions involve faster gradient‑free optimizers, larger multimodal backbones, and extending to 3D or spatial tasks.

Paper PDF
Paper PDF
GitHub repository
GitHub repository
LLMmultimodalAI researchzero-shotTraining-FreeMILS
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.