Understanding Multimodal Large Language Models: Part 1
This article explains the fundamentals of multimodal large language models, covering their definition, typical applications, two main architectural approaches—unified embedding decoder and cross‑modal attention—along with detailed component breakdowns, a PyTorch implementation of image‑patch projection, and training considerations, ending with a discussion of trade‑offs between the methods.
Introduction
The article aims to explain how Multimodal Large Language Models (Multimodal LLMs) work and will later review more than ten recent papers in the field, comparing their technical paths.
What Is a Multimodal LLM?
A Multimodal LLM can accept inputs of different modalities—audio, text, images, and video—and produces text as output. The discussion focuses on handling images and text together.
Typical Applications
The most intuitive use case is image captioning: a user provides an image and the model returns a textual description. Other examples include extracting structured information from PDF tables and converting it to LaTeX or Markdown.
Construction Schemes
Two main model‑structure schemes are presented:
Scheme A: Unified Embedding Decoder Architecture
Scheme B: Cross‑Modal Cross‑Attention Architecture
Scheme A – Unified Embedding Decoder
This approach uses an unmodified decoder‑only LLM (e.g., GPT‑2, Phi‑3, Gemma, or Llama 3.2). Images are converted into tokens whose embeddings have the same dimension as text token embeddings, allowing the model to concatenate image and text tokens and process them jointly.
Standard text processing involves tokenisation (often byte‑pair encoding) followed by an embedding layer, as illustrated in the diagram.
The image encoder first splits an image into patches, then encodes each patch with a pretrained Vision Transformer (ViT). The resulting patch embeddings are flattened and passed through a linear projection layer to match the LLM’s embedding dimension.
Linear projection maps a 256‑dimensional patch vector to a 768‑dimensional space, as shown below.
PyTorch implementation of the patch projection layer:
import torch
class PatchProjectionLayer(torch.nn.Module):
def __init__(self, patch_size, num_channels, embedding_dim):
super().__init__()
self.patch_size = patch_size
self.num_channels = num_channels
self.embedding_dim = embedding_dim
self.projection = torch.nn.Linear(patch_size * patch_size * num_channels, embedding_dim)
def forward(self, x):
batch_size, num_patches, channels, height, width = x.size()
x = x.view(batch_size, num_patches, -1) # Flatten each patch
x = self.projection(x) # Project each flattened patch
return x
# Example Usage:
batch_size = 1
num_patches = 9 # Total patches per image
patch_size = 16 # 16x16 pixels per patch
num_channels = 3 # RGB image
embedding_dim = 768 # Size of the embedding vector
projection_layer = PatchProjectionLayer(patch_size, num_channels, embedding_dim)
patches = torch.rand(batch_size, num_patches, num_channels, patch_size, patch_size)
projected_embeddings = projection_layer(patches)
print(projected_embeddings.shape)
# This prints
# torch.Size([1, 9, 768])Scheme B – Cross‑Modal Cross‑Attention
This method also uses a pretrained image encoder (e.g., CLIP or OpenCLIP) but integrates image and text features via a cross‑attention layer inside the Transformer. Instead of concatenating image tokens, the model learns to align image patch keys/values with text queries dynamically.
The cross‑attention mechanism follows the original Transformer design ("Attention Is All You Need", 2017). Queries come from the decoder, while keys and values come from the encoder (here, the image encoder). The two input sequences can have different lengths, but their embedding dimensions must match.
When the number of image tokens equals the number of text tokens, cross‑attention reduces to self‑attention.
Training Process
Multimodal LLM training mirrors text‑only LLM training with two stages: pre‑training and instruction fine‑tuning. Typically, a pretrained text‑only LLM (e.g., GPT, LLaMA) serves as the base. The image encoder (often CLIP) is kept frozen during pre‑training, while only a linear projector is trained. During instruction fine‑tuning, the LLM may be unfrozen for full parameter updates. In the cross‑attention scheme, the cross‑attention layers remain trainable throughout.
Method Comparison and Trade‑offs
Which method is better? The answer depends on specific trade‑offs.
Scheme A is easier to implement because it does not modify the original decoder architecture. Scheme B often offers higher computational efficiency because it avoids lengthening the input sequence with image tokens, preserving the original LLM’s text generation quality when the LLM parameters stay frozen.
Conclusion
The article provides a foundational overview of Multimodal LLMs, covering definitions, applications, two architectural families, component details, training stages, and practical trade‑offs. Future installments will analyze recent research papers to illustrate concrete applications of these methods.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI Algorithm Path
A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
