PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers
The article introduces PreFLMR, an open‑source, general‑purpose pre‑trained multimodal retriever that leverages fine‑grained late‑interaction to boost retrieval‑augmented generation for knowledge‑intensive visual tasks, describes its M2KR benchmark, training stages, and strong experimental results across multiple tasks.
PreFLMR is a universal pre‑trained multimodal knowledge retriever designed for building multimodal Retrieval‑Augmented Generation (RAG) applications. It builds on the Fine‑grained Late‑interaction Multi‑modal Retriever (FLMR) presented at NeurIPS 2023, adding model improvements and large‑scale pre‑training on the M2KR dataset.
Background : While large multimodal models such as GPT‑4‑Vision and Gemini excel at general image‑text understanding, they still struggle with knowledge‑intensive queries. RAG offers a solution by first retrieving relevant documents from specialized corpora and then feeding them to a large model for answer generation. The effectiveness of this pipeline hinges on the retriever’s ability to recall the right knowledge.
PreFLMR, released by the Cambridge AI Lab, addresses this need with three main advantages:
It supports multiple sub‑tasks (text‑text, image‑text, knowledge retrieval) as a single pre‑trained model, achieving strong performance after modest fine‑tuning on domain‑specific data.
It overcomes the limitations of dense passage retrieval (DPR) by using token‑level representations instead of a single vector, preserving fine‑grained information crucial for multimodal queries.
It can follow user instructions (e.g., “extract documents relevant to the question” or “extract documents related to the image”) to retrieve appropriate evidence for downstream models.
The authors open‑source three model sizes: PreFLMR_ViT‑B (207 M parameters), PreFLMR_ViT‑L (422 M), and PreFLMR_ViT‑G (2 B), allowing users to choose based on resource constraints.
Additional Contributions :
A large‑scale multimodal retrieval benchmark called Multi‑task Multi‑modal Knowledge Retrieval (M2KR), which aggregates ten public datasets into a unified question‑document format, totaling over one million retrieval pairs.
An extensive analysis of image and text encoder scaling, showing that increasing visual encoder size yields larger gains than scaling the text encoder for late‑interaction models.
M2KR Dataset : The benchmark unifies tasks such as image captioning and multimodal dialogue into a consistent retrieval format, providing diverse evaluation scenarios.
PreFLMR Training Pipeline consists of four stages:
Text encoder pre‑training on MSMARCO for dense text‑text retrieval.
Image‑text projection pre‑training on M2KR while freezing other components.
Continual pre‑training on a high‑quality visual‑question‑answering task (E‑VQA) to enhance fine‑grained knowledge retrieval.
General retrieval training on the full M2KR dataset, freezing the image encoder and jointly fine‑tuning query and document text encoders.
Experiments show that the best PreFLMR model (ViT‑G + ColBERT‑base) with ~2 B parameters outperforms baselines on seven M2KR sub‑tasks. Scaling the visual encoder yields larger performance gains than scaling the text encoder, and a simple cross‑attention layer suffices for image‑text projection.
In knowledge‑intensive visual QA tasks, integrating PreFLMR into RAG pipelines improves results dramatically (e.g., 94 % gain on Infoseek and 275 % on E‑VQA), allowing smaller models such as BLIP‑2 to surpass much larger systems like PALI‑X and PaLM‑Bison+Lens.
Conclusion : PreFLMR is the first open‑source, general‑purpose late‑interaction multimodal retriever, demonstrating strong performance after massive pre‑training on M2KR. The dataset, model weights, and code are publicly available.
Resources :
FLMR paper (NeurIPS 2023): https://proceedings.neurips.cc/paper/2023/hash/47393e8594c82ce8fd83adc672cf9872-Abstract-Conference.html
Code repository: https://github.com/LinWeizheDragon/Retrieval-Augmented-Visual-Question-Answering
English blog post: https://www.jinghong-chen.net/preflmr-sota-open-sourced-multi/
FLMR overview: https://www.jinghong-chen.net/fined-grained-late-interaction-multimodal-retrieval-flmr/
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.