How Retrv-R1 Redefines Universal Multimodal Retrieval with Reasoning‑Driven MLLM

Retrv‑R1, a reasoning‑driven multimodal large language model framework, tackles the precision‑efficiency dilemma of universal multimodal retrieval by introducing a two‑stage coarse‑to‑fine pipeline, an information‑compression module, a detail‑inspection mechanism, and a three‑stage training strategy, achieving SOTA performance across accuracy, efficiency, and generalization benchmarks.

Tencent Advertising Technology
Tencent Advertising Technology
Tencent Advertising Technology
How Retrv-R1 Redefines Universal Multimodal Retrieval with Reasoning‑Driven MLLM

Background and Motivation

Universal multimodal retrieval, which must handle text, images, and other modalities simultaneously, has long suffered from a precision‑efficiency trade‑off. Traditional embedding‑similarity methods lack accuracy, while converting retrieval into a pure MLLM QA task sacrifices reasoning ability, especially in complex scenarios.

Limitations of Prior Approaches

Embedding‑based retrieval suffers from low robustness and errors in challenging cases. Directly applying MLLM‑based QA lacks explicit reasoning, leading to hallucinations and poor generalization. Although DeepSeek‑R1 demonstrated that reinforcement learning (RL) can boost LLM reasoning, naïvely extending it to multimodal retrieval causes token explosion, unstable convergence, and inaccurate reasoning steps.

Retrv‑R1: A Reasoning‑Driven MLLM Framework

Retrv‑R1 is the first R1‑style multimodal LLM specifically designed for universal retrieval. Its core innovations are:

Two‑Stage Retrieval (Coarse‑to‑Fine) : A fast coarse filter selects the top‑K candidates using embedding similarity, followed by a fine‑grained reasoning stage where an optimized MLLM generates step‑by‑step reasoning to pick the final answer.

Information Compression Module (ICM) : Compresses each candidate into two tokens— content token (summarizes the candidate itself) and relation token (captures its relevance to the query). Self‑alignment pre‑training ensures the compressed tokens retain critical retrieval information, reducing token consumption by >7×.

Detail Inspection Mechanism (DIM) : During CoT reasoning, the model inserts special tokens <inspection-index-start> and <inspection-index-end> to trigger full‑token retrieval for high‑difficulty candidates, enabling selective detailed checks.

Three‑Stage Training Strategy

ICM Pre‑training : Train ICM on the M‑BEIR dataset while freezing the MLLM, optimizing only the compression module.

SFT Activation : Synthesize four‑step CoT data (ideal answer, quick negative filtering, fine‑grained verification, final answer) using Qwen2.5‑VL‑72B and fine‑tune the model to follow the reasoning format.

RL Fine‑tuning : Optimize a composite reward (GRPO) that balances accuracy, token efficiency, and format correctness, employing curriculum learning where early stages prioritize accuracy and later stages increase the efficiency weight.

Experimental Evaluation

Experiments were conducted on 16 × A100 GPUs, comparing Retrv‑R1 against general MLLMs (BLIP‑2, Qwen2.5‑VL), R1‑style models (Vision‑R1, VLM‑R1), and retrieval‑specific MLLMs (MM‑Embed, LamRA). Results show:

Superior accuracy on all benchmark tasks, even a 3B‑parameter Retrv‑R1‑3B outperforms larger 7B baselines.

Significant efficiency gains: lower inference time (ITR) and GPU memory usage (GMR) across varying K values.

Robust generalization to unseen datasets, held‑out tasks, and multimodal recommendation scenarios.

State‑of‑the‑art performance in Retrieval‑Augmented Generation (RAG) tasks, improving both retrieval and visual‑question‑answering metrics.

RL Fine‑tuning Insights

During RL fine‑tuning, the frequency of DIM checks first rises then falls, indicating the model initially emphasizes accuracy via extensive inspection and later shifts toward efficiency as the reward weight changes. The model also learns flexible reasoning patterns, such as self‑reflection to rescue false negatives and explicit prompts when no correct candidate is found.

Conclusion and Future Work

Retrv‑R1 demonstrates that a reasoning‑driven MLLM framework can simultaneously achieve high precision and efficiency for universal multimodal retrieval, with strong generalization to diverse tasks including RAG. Future directions include extending the compression paradigm to longer contexts and multi‑turn interactive retrieval.

References

[1] L. Zhu et al., "Retrv‑R1: A Reasoning‑Driven MLLM Framework for Universal and Efficient Multimodal Retrieval," arXiv:2510.02745, 2025.

[2] Y. Liu et al., "Lamra: Large multimodal model as your advanced retrieval assistant," CVPR, 2025.

[3] S. Bai et al., "Qwen2.5‑VL technical report," arXiv:2502.13923, 2025.

[4] A. Baldrati et al., "Zero‑shot composed image retrieval with textual inversion," ICCV, 2023.

[5] S. Chun et al., "Probabilistic embeddings for cross‑modal retrieval," CVPR, 2021.

[6] Z. Fu et al., "Linguistic‑aware patch slimming for fine‑grained cross‑modal alignment," CVPR, 2024.

[7] D. Guo et al., "Deepseek‑R1: Incentivizing reasoning capability in LLMs via RL," arXiv:2501.12948, 2025.

[8] D. Ji et al., "Raven: Robust advertisement video violation temporal grounding via reinforcement reasoning," ACL Industry Track, 2025.

[9] D. Ji et al., "Discrete latent perspective learning for segmentation and detection," arXiv:2406.10475, 2024.

[10] D. Ji et al., "Tree‑of‑table: Unleashing the power of LLMs for large‑scale table understanding," arXiv:2411.08516, 2024.

[11] H. Tan et al., "Reason‑RFT: Reinforcement fine‑tuning for visual reasoning," arXiv:2503.20752, 2025.

[12] W. Huang et al., "Vision‑R1: Incentivizing reasoning capability in multimodal LLMs," arXiv:2503.06749, 2025.

[13] S. Zhang et al., "Llava‑mini: Efficient image and video multimodal models with one vision token," arXiv:2501.03895, 2025.

[14] J. Li et al., "BLIP‑2: Bootstrapping language‑image pre‑training with frozen image encoders and LLMs," ICML, 2023.

[15] H. Shen et al., "VLM‑R1: A stable and generalizable R1‑style large vision‑language model," arXiv:2504.07615, 2025.

[16] S. Lin et al., "MM‑Embed: Universal multimodal retrieval with multimodal LLMs," ICLR, 2025.

Figure 2-1: Retrv‑R1 core idea diagram
Figure 2-1: Retrv‑R1 core idea diagram
Figure 3-1: Performance comparison
Figure 3-1: Performance comparison
Figure 3-5: RL fine‑tuning analysis
Figure 3-5: RL fine‑tuning analysis
Figure 4: Concluding illustration
Figure 4: Concluding illustration
efficiencyreinforcement learningmultimodal retrievalMLLMGeneralizationinformation compressiondetail inspection
Tencent Advertising Technology
Written by

Tencent Advertising Technology

Official hub of Tencent Advertising Technology, sharing the team's latest cutting-edge achievements and advertising technology applications.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.