How LoRA Enables Multimodal Capabilities in Large Language Models

This article compares two ways to add vision to large language models—training a native multimodal model from scratch or attaching a visual module to a pretrained LLM—then details the VoRA approach that uses LoRA adapters to inject visual knowledge without extra inference cost.

AI Algorithm Path
AI Algorithm Path
AI Algorithm Path
How LoRA Enables Multimodal Capabilities in Large Language Models

Introduction

When large language models (LLMs) entered the consumer market, the desire to go beyond pure language modeling grew. Vision is the first modality tackled, leading to many vision‑language models (VLMs). Two main routes exist: training a native multimodal model (NMM) from scratch, or attaching a visual module to a pretrained LLM.

Native Multimodal Models (Early Fusion)

Early‑fusion models share a unified discrete token space across modalities. By this definition, VisualBERT (2019), Flamingo (2022) and PaLI (2022) are excluded. According to our criteria, Meta’s Chameleon (2024) is the first true native multimodal model and inspired Llama 4, Gemini 2.5.

Chameleon inherits the Llama‑2 backbone but introduces key changes: the activation function is replaced with SwiGLU and positional encoding uses RoPE. The softmax shift‑invariance of Llama leads to a logit‑drift issue, which the paper mitigates through a careful balance of training stability and performance.

Pretrained LLM + Visual Module

The more common path leverages a pretrained LLM and adds a visual encoder, as exemplified by the LLaVA paper. An image is processed by a frozen visual encoder (e.g., CLIP or ViT), projected into the token embedding space via a trainable matrix, and then fed to the language model. During training the projection matrix and the LLM parameters are updated, while the visual encoder remains frozen.

Limitations of this pipeline include fixed image resolution of most visual encoders and a serial workflow where the language model must wait for the visual encoder, reducing efficiency.

Vision‑as‑LoRA (VoRA)

VoRA proposes to eliminate external visual models and preserve the pretrained LLM’s knowledge. Only a LoRA adapter for visual context and an image‑embedding layer are fine‑tuned, making the LoRA adapter the model’s “visual parameters”.

During pretraining, all linear layers (including QKV projections and FFN) in the first N ViT layers receive LoRA modules. After pretraining, the LoRA weights are merged into the LLM block, incurring no extra inference cost.

The paper also distills knowledge from a pretrained ViT: visual hidden states are aligned with ViT hidden states, accelerating training and limiting updates to LoRA parameters instead of a full projection layer. The training objective combines a distillation loss (cosine similarity between projected LLM features and ViT embeddings) and a language‑modeling loss (cross‑entropy).

Although still early, VoRA shows promise for building multimodal models beyond vision, such as audio, video, or 3D, by decoupling modality‑specific parameters and reducing training time.

References

PaLM (JMLR 2022)

LLaVA (arXiv 2304.08485)

Chameleon (arXiv 2405.09818)

VoRA (arXiv 2503.20680v1)

Flamingo (arXiv 2204.14198)

PaliGemma (arXiv 2407.07726)

LoRAvision-language modelsChameleonmultimodal LLMLLaVAVoRA
AI Algorithm Path
Written by

AI Algorithm Path

A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.