8 min read

How BGE’s New Code and Multimodal Vector Models Set New Retrieval Benchmarks

The article introduces three BGE vector models—BGE‑Code‑v1, BGE‑VL‑v1.5, and BGE‑VL‑Screenshot—detailing their architectures, open‑source resources, benchmark results on CoIR, Code‑RAG, MMEB, and MVRB, and their impact on code and multimodal retrieval research.

AI Frontier Lectures

May 21, 2025

How BGE’s New Code and Multimodal Vector Models Set New Retrieval Benchmarks

BGE-Code-v1: A Next‑Generation Code Embedding Model

Built on the Qwen2.5‑Coder‑1.5B base, BGE‑Code‑v1 targets code‑centric retrieval tasks while retaining strong multilingual text understanding. It is trained on the CoIR corpus and large synthetic code‑text pairs, supplemented with curriculum learning and retrieval/STS data from BGE‑gemma2‑multilingual.

Model URL: https://huggingface.co/BAAI/bge-code-v1 Project repository:

https://github.com/FlagOpen/FlagEmbedding/tree/master/research/BGE_Coder

Paper: https://arxiv.org/abs/2505.12697 On the CoIR benchmark (covering 14 programming languages and 8 sub‑tasks) and CodeRAG‑Bench, BGE‑Code‑v1 outperforms commercial and open‑source rivals such as Google, Voyage AI, Salesforce, and Jina, achieving state‑of‑the‑art (SOTA) scores.

BGE-VL-v1.5: General Multimodal Retrieval Model

Derived from LLaVA‑1.6 (7.57 B parameters), BGE‑VL‑v1.5 enhances image‑text understanding and retrieval capability. Training combines 3 M image‑caption pairs from MegaPairs with an additional 1 M natural and synthetic samples covering image captioning, visual QA, and classification.

Model URL: https://huggingface.co/BAAI/BGE-VL-v1.5-zs Project repository:

https://github.com/FlagOpen/FlagEmbedding/tree/master/research/BGE_VL

Paper: https://arxiv.org/abs/2412.14475 Zero‑shot evaluation on the MMEB benchmark shows BGE‑VL‑v1.5‑zs achieving the best zero‑shot performance, while the fine‑tuned BGE‑VL‑v1.5‑MMEB reaches a 72.16 score, topping the leaderboard across image retrieval, multimodal matching, and cross‑modal recommendation tasks.

BGE-VL‑Screenshot: Visual Document Embedding Model

Based on Qwen2.5‑VL‑3B‑Instruct, this model is trained on seven data sources (news, e‑commerce, papers, documents, project pages, etc.), amassing over 13 M screenshots and 7 M annotated screenshot‑QA pairs. It targets Vis‑IR scenarios where images, text, and graphical elements coexist.

Model URL: https://huggingface.co/BAAI/BGE-VL-Screenshot Project repository:

https://github.com/FlagOpen/FlagEmbedding/tree/master/research/BGE_VL_Screenshot

Paper: https://arxiv.org/abs/2502.11431 On the newly introduced MVRB benchmark (20 datasets covering screenshot retrieval, composite screenshot retrieval, screenshot QA, and open‑category classification), BGE‑VL‑Screenshot attains a 60.61 overall score, setting a new SOTA and demonstrating strong multilingual performance after limited query‑to‑screenshot training.

Overall Impact and Availability

The three models are fully open‑source and have collectively amassed over 600 million downloads, becoming the first Chinese model to top Hugging Face’s leaderboard and the most downloaded model of 2023. They are widely adopted in Retrieval‑Augmented Generation (RAG), neural search, and other AI‑driven retrieval pipelines.