Artificial Intelligence 10 min read

Inside Meta’s PerceptionLM: A Deep Dive into Open‑Source Vision‑Language Models

The article provides a detailed analysis of Meta’s PerceptionLM, an open‑source perception language model built on Llama 3, describing its vision encoder, projector, dynamic tiling, three‑stage training pipeline, model variants, and competitive performance on image and video benchmarks.

AI Algorithm Path

Aug 3, 2025

Inside Meta’s PerceptionLM: A Deep Dive into Open‑Source Vision‑Language Models

Architecture Design

The backbone is a pretrained large language model (LLM), Llama 3. Because LLMs cannot ingest raw visual data, two additional components are added:

Perception Encoder (PE) – a vision encoder described in a separate paper that extracts visual embeddings from images and videos.

Projector – a two‑layer MLP that maps visual embeddings into the LLM’s embedding space for cross‑modal alignment.

Text is tokenized by Llama 3’s tokenizer. After projection, visual embeddings are concatenated with token embeddings and fed jointly to the LLM, enabling simultaneous multimodal reasoning.

Model Variants

Base models: 1 B and 3 B Llama 3.2 parameters paired with a 300 M‑parameter vision encoder (PE‑L).

Enhanced model: 8 B Llama 3.1 parameters paired with a 1.9 B‑parameter vision encoder (PE‑G).

High‑Resolution Support

When an input exceeds the encoder’s resolution limit, Dynamic Tiling splits the image into multiple tiles. Each tile is down‑sampled by average pooling (2×2 pixel blocks replaced by their mean). For video, 32 frames are extracted and the same pooling is applied.

Training Process

Stage 1 – Pre‑heat (synthetic image data)

Only the projector is trained; the vision encoder and LLM remain frozen. One million small synthetic images paired with generated captions are used, and the LLM is trained with a next‑token prediction objective.

Stage 2 – Mid‑training (large‑scale image & video data)

The vision encoder, LLM, and projector are unfrozen and co‑trained on a 64.7 M‑sample dataset containing images and videos. Images may be tiled up to 16 pieces; videos are sampled at 16 frames. Each tile/frame undergoes the same average‑pooling down‑sampling. Supervision consists of synthetic captions and synthetic question‑answer pairs generated by a pure‑text LLM.

Stage 3 – Supervised Fine‑Tuning (human‑annotated data)

The final stage uses 14 M high‑resolution, human‑annotated samples that include challenging video QA tasks. Dynamic tiling is extended to up to 36 tiles for images and 32 frames for videos. This phase adopts supervised fine‑tuning, requiring the model to generate answers to provided questions.

Performance Evaluation

PLM is compared against both closed‑source and open‑source baselines on image and video benchmarks.

Image tasks : PLM achieves competitive scores across captioning, hard perception, and related categories, often surpassing the latest open‑source models of comparable size.

Video tasks : PLM shows comparable or superior performance on video QA benchmarks, with several metrics exceeding the best known results. Improvements are especially notable on hard QA, although error rates remain high.

References

Paper: https://arxiv.org/abs/2504.13180

Perception Encoder: https://arxiv.org/abs/2504.13181

Code repository: https://github.com/facebookresearch/perception_models

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Open‑Source AI Llama3 Vision-Language Model Multimodal Training Dynamic Tiling PerceptionLM

Written by

AI Algorithm Path

A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Architecture Design

Model Variants

High‑Resolution Support

Training Process

Stage 1 – Pre‑heat (synthetic image data)

Stage 2 – Mid‑training (large‑scale image & video data)

Stage 3 – Supervised Fine‑Tuning (human‑annotated data)

Performance Evaluation

References

AI Algorithm Path

How this landed with the community

Was this worth your time?

0 Comments

Stage 1 – Pre‑heat (synthetic image data)

Stage 2 – Mid‑training (large‑scale image & video data)

Stage 3 – Supervised Fine‑Tuning (human‑annotated data)