Artificial Intelligence 14 min read

Understanding Multimodal Large Language Models: Part 1

This article explains the fundamentals of multimodal large language models, covering their definition, typical applications, two main architectural approaches—unified embedding decoder and cross‑modal attention—along with detailed component breakdowns, a PyTorch implementation of image‑patch projection, and training considerations, ending with a discussion of trade‑offs between the methods.

AI Algorithm Path

Mar 19, 2025

Understanding Multimodal Large Language Models: Part 1

Introduction

The article aims to explain how Multimodal Large Language Models (Multimodal LLMs) work and will later review more than ten recent papers in the field, comparing their technical paths.

What Is a Multimodal LLM?

A Multimodal LLM can accept inputs of different modalities—audio, text, images, and video—and produces text as output. The discussion focuses on handling images and text together.

Typical Applications

The most intuitive use case is image captioning: a user provides an image and the model returns a textual description. Other examples include extracting structured information from PDF tables and converting it to LaTeX or Markdown.

Construction Schemes

Two main model‑structure schemes are presented:

Scheme A: Unified Embedding Decoder Architecture

Scheme B: Cross‑Modal Cross‑Attention Architecture

Scheme A – Unified Embedding Decoder

This approach uses an unmodified decoder‑only LLM (e.g., GPT‑2, Phi‑3, Gemma, or Llama 3.2). Images are converted into tokens whose embeddings have the same dimension as text token embeddings, allowing the model to concatenate image and text tokens and process them jointly.

Standard text processing involves tokenisation (often byte‑pair encoding) followed by an embedding layer, as illustrated in the diagram.

The image encoder first splits an image into patches, then encodes each patch with a pretrained Vision Transformer (ViT). The resulting patch embeddings are flattened and passed through a linear projection layer to match the LLM’s embedding dimension.

Linear projection maps a 256‑dimensional patch vector to a 768‑dimensional space, as shown below.

PyTorch implementation of the patch projection layer:

import torch
class PatchProjectionLayer(torch.nn.Module):
    def __init__(self, patch_size, num_channels, embedding_dim):
        super().__init__()
        self.patch_size = patch_size
        self.num_channels = num_channels
        self.embedding_dim = embedding_dim
        self.projection = torch.nn.Linear(patch_size * patch_size * num_channels, embedding_dim)
    def forward(self, x):
        batch_size, num_patches, channels, height, width = x.size()
        x = x.view(batch_size, num_patches, -1)  # Flatten each patch
        x = self.projection(x)  # Project each flattened patch
        return x
# Example Usage:
batch_size = 1
num_patches = 9  # Total patches per image
patch_size = 16  # 16x16 pixels per patch
num_channels = 3  # RGB image
embedding_dim = 768  # Size of the embedding vector
projection_layer = PatchProjectionLayer(patch_size, num_channels, embedding_dim)
patches = torch.rand(batch_size, num_patches, num_channels, patch_size, patch_size)
projected_embeddings = projection_layer(patches)
print(projected_embeddings.shape)
# This prints
# torch.Size([1, 9, 768])

Scheme B – Cross‑Modal Cross‑Attention

This method also uses a pretrained image encoder (e.g., CLIP or OpenCLIP) but integrates image and text features via a cross‑attention layer inside the Transformer. Instead of concatenating image tokens, the model learns to align image patch keys/values with text queries dynamically.

The cross‑attention mechanism follows the original Transformer design ("Attention Is All You Need", 2017). Queries come from the decoder, while keys and values come from the encoder (here, the image encoder). The two input sequences can have different lengths, but their embedding dimensions must match.

When the number of image tokens equals the number of text tokens, cross‑attention reduces to self‑attention.

Training Process

Multimodal LLM training mirrors text‑only LLM training with two stages: pre‑training and instruction fine‑tuning. Typically, a pretrained text‑only LLM (e.g., GPT, LLaMA) serves as the base. The image encoder (often CLIP) is kept frozen during pre‑training, while only a linear projector is trained. During instruction fine‑tuning, the LLM may be unfrozen for full parameter updates. In the cross‑attention scheme, the cross‑attention layers remain trainable throughout.

Method Comparison and Trade‑offs

Which method is better? The answer depends on specific trade‑offs.

Scheme A is easier to implement because it does not modify the original decoder architecture. Scheme B often offers higher computational efficiency because it avoids lengthening the input sequence with image tokens, preserving the original LLM’s text generation quality when the LLM parameters stay frozen.

Conclusion

The article provides a foundational overview of Multimodal LLMs, covering definitions, applications, two architectural families, component details, training stages, and practical trade‑offs. Future installments will analyze recent research papers to illustrate concrete applications of these methods.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Transformer PyTorch Cross-Attention Multimodal LLM Image Encoder Linear Projection

Written by

AI Algorithm Path

A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.