Understanding Multimodal Large Language Models: Recent Advances and Comparative Analysis
This article surveys the latest multimodal large language model research, dissecting the design, training strategies, and performance trade‑offs of models such as Llama 3.2, Molmo, NVLM, Qwen2‑VL, Pixtral, MM1.5, Emu3, and Janus, and highlights the challenges of fair cross‑model evaluation.
The article focuses on the newest research progress in multimodal large language models (LLMs), summarizing representative papers rather than providing a comprehensive historical review. It aims to distill technical innovations and core application scenarios, and concludes with a cross‑model comparison.
Llama 3.2
Meta AI released the paper "The Llama 3 Herd of Models" on July 31 2024. The multimodal Llama 3.2 comes in 11 B and 90 B parameter versions that support image‑text interaction via a cross‑attention mechanism. Unlike typical pipelines that freeze the image encoder during pre‑training, Llama 3.2 updates the image encoder while keeping the language model frozen, allowing the multimodal models to replace the corresponding pure‑text Llama 3.1 models without breaking existing text‑only pipelines. Training proceeds in stages: start from the Llama 3.1 pure‑text model, add an image encoder and projection layer, pre‑train on image‑text data, then perform instruction fine‑tuning and preference optimization. The visual encoder is a ViT‑H/14 (6.3 × 10⁸ params) pretrained on 2.5 B image‑text pairs (224×224 patches, 16×16 px). Cross‑attention layers are inserted every three transformer blocks, adding ~3 B parameters to the 80 B model and ~200 B parameters to the 700 B model.
Molmo and PixMo
The September 25 2024 paper "Molmo and PixMo" introduces an open‑source multimodal model and its accompanying image dataset. Molmo uses a CLIP visual encoder as the image backbone and a unified projection layer to align visual features with the LLM. Unlike many prior works, Molmo updates all parameters jointly (LLM, projection, and visual encoder). The base LLM can be selected from OLMo‑7B‑1024, OLMoE‑1B‑7B, Qwen2‑7B, or Qwen2‑72B. Training follows three stages: (1) train only the projection layer while freezing the LLM and visual encoder, (2) unfreeze the visual encoder and train it, and (3) unfreeze the entire model for end‑to‑end fine‑tuning.
NVLM
NVIDIA’s September 17 2024 paper "NVLM: Open Frontier‑Class Multimodal LLMs" investigates two core architectures: (A) a pure‑decoder design (NVLM‑D) and (B) a cross‑attention design (NVLM‑X), plus a hybrid variant (NVLM‑H) that combines both. Findings include: NVLM‑X offers significant computational efficiency for high‑resolution images; NVLM‑D achieves higher OCR accuracy; NVLM‑H successfully merges the strengths of both. The model uses Qwen2‑72B‑Instruct as the LLM backbone, a frozen InternViT‑6B visual encoder, and a multi‑layer perceptron projector instead of a single linear layer.
Qwen2‑VL
Building on the Qwen2‑72B foundation, Qwen2‑VL introduces a "native dynamic resolution" mechanism. The visual encoder is a modified ViT that removes absolute positional encodings and adds 2D‑RoPE. A 675 M‑parameter visual encoder is paired with LLM backbones of various sizes, enabling direct processing of images at their original resolution without down‑sampling.
Pixtral 12B
Mistral AI’s September 17 2024 release "Pixtral 12B" is the company’s first multimodal model. It trains a 400 M‑parameter image encoder from scratch and pairs it with the 12 B‑parameter Mistral NeMo LLM. The architecture follows the unified‑embedding decoder (method A) and supports variable‑size images, as shown in the accompanying figures.
MM1.5
The September 30 2024 paper "MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine‑tuning" presents a unified‑embedding decoder approach, provides practical fine‑tuning tricks, and introduces an expert‑mixture multimodal model. Model scales range from 1 B to 30 B parameters.
Emu3
Emu3, described in the September 27 2024 paper "Emu3: Next‑Token Prediction is All You Need", demonstrates a transformer‑decoder‑only image generation pipeline. The model is trained from scratch, uses Direct Preference Optimization (DPO) for alignment, and incorporates a visual tokenizer inspired by SBER‑MoVQGAN. Its LLM backbone is based on Llama 2.
Janus
The October 17 2024 paper "Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation" proposes a framework that separates visual encoding paths for understanding (high‑dimensional semantic) and generation (fine‑grained local detail). Janus adopts a SigLIP visual encoder, a VQ tokenizer for generation, and DeepSeek‑LLM (1.3 B) as the language backbone. Training proceeds in three stages: (1) train only projection and image‑output layers, (2) unfreeze the LLM and text output layers for joint pre‑training, and (3) unfreeze the entire model, including the visual encoder, for supervised fine‑tuning.
Conclusion
The author notes two major obstacles to fair benchmarking: (1) data contamination—public benchmarks may contain training data, making LLM vs. multimodal LLM comparisons unreliable; and (2) vast architectural heterogeneity—differences in encoders, decoders, and multimodal interfaces hinder direct comparisons. NVIDIA’s multi‑variant NVLM study is highlighted as a rare example of a systematic cross‑architecture analysis. The overall conclusion is that multimodal LLMs can be constructed successfully through a variety of design choices, and the article provides a consolidated view of the core components and their variants.
AI Algorithm Path
A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
