Inside Meta’s PerceptionLM: A Deep Dive into Open‑Source Vision‑Language Models

The article provides a detailed analysis of Meta’s PerceptionLM, an open‑source perception language model built on Llama 3, describing its vision encoder, projector, dynamic tiling, three‑stage training pipeline, model variants, and competitive performance on image and video benchmarks.

AI Algorithm Path
AI Algorithm Path
AI Algorithm Path
Inside Meta’s PerceptionLM: A Deep Dive into Open‑Source Vision‑Language Models

Architecture Design

The backbone is a pretrained large language model (LLM), Llama 3. Because LLMs cannot ingest raw visual data, two additional components are added:

Perception Encoder (PE) – a vision encoder described in a separate paper that extracts visual embeddings from images and videos.

Projector – a two‑layer MLP that maps visual embeddings into the LLM’s embedding space for cross‑modal alignment.

Text is tokenized by Llama 3’s tokenizer. After projection, visual embeddings are concatenated with token embeddings and fed jointly to the LLM, enabling simultaneous multimodal reasoning.

Model Variants

Base models: 1 B and 3 B Llama 3.2 parameters paired with a 300 M‑parameter vision encoder (PE‑L).

Enhanced model: 8 B Llama 3.1 parameters paired with a 1.9 B‑parameter vision encoder (PE‑G).

High‑Resolution Support

When an input exceeds the encoder’s resolution limit, Dynamic Tiling splits the image into multiple tiles. Each tile is down‑sampled by average pooling (2×2 pixel blocks replaced by their mean). For video, 32 frames are extracted and the same pooling is applied.

Training Process

Stage 1 – Pre‑heat (synthetic image data)

Only the projector is trained; the vision encoder and LLM remain frozen. One million small synthetic images paired with generated captions are used, and the LLM is trained with a next‑token prediction objective.

Stage 2 – Mid‑training (large‑scale image & video data)

The vision encoder, LLM, and projector are unfrozen and co‑trained on a 64.7 M‑sample dataset containing images and videos. Images may be tiled up to 16 pieces; videos are sampled at 16 frames. Each tile/frame undergoes the same average‑pooling down‑sampling. Supervision consists of synthetic captions and synthetic question‑answer pairs generated by a pure‑text LLM.

Stage 3 – Supervised Fine‑Tuning (human‑annotated data)

The final stage uses 14 M high‑resolution, human‑annotated samples that include challenging video QA tasks. Dynamic tiling is extended to up to 36 tiles for images and 32 frames for videos. This phase adopts supervised fine‑tuning, requiring the model to generate answers to provided questions.

Performance Evaluation

PLM is compared against both closed‑source and open‑source baselines on image and video benchmarks.

Image tasks : PLM achieves competitive scores across captioning, hard perception, and related categories, often surpassing the latest open‑source models of comparable size.

Video tasks : PLM shows comparable or superior performance on video QA benchmarks, with several metrics exceeding the best known results. Improvements are especially notable on hard QA, although error rates remain high.

References

Paper: https://arxiv.org/abs/2504.13180
Perception Encoder: https://arxiv.org/abs/2504.13181
Code repository: https://github.com/facebookresearch/perception_models
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Open‑Source AILlama3Vision-Language ModelMultimodal TrainingDynamic TilingPerceptionLM
AI Algorithm Path
Written by

AI Algorithm Path

A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.