Conan-Embedding-V2: A 1.4B LLM‑Based Multilingual Embedding Model Achieving SOTA on MTEB
Conan‑Embedding‑V2, a newly trained 1.4 B‑parameter LLM with a custom tokenizer, 32 k token context, SoftMask, cross‑lingual retrieval data and dynamic hard‑negative mining, delivers state‑of‑the‑art multilingual embeddings that surpass larger models on both English and Chinese MTEB benchmarks while remaining compact and fast.
Embedding models are a crucial component of Retrieval‑Augmented Generation (RAG). In August 2024 the team released Conan‑Embedding‑V1, which achieved SOTA on the CMTEB leaderboard and was open‑sourced on HuggingFace. Building on that success, Conan‑Embedding‑V2 is introduced. It is built on a newly trained 1.4B parameter large language model (Conan‑1.4B) and reaches SOTA performance on both Chinese and English benchmarks of MTEB, surpassing larger models such as NVIDIA’s and Qwen.
Background : The V1 model was based on a generic bidirectional BERT backbone. V2, however, trains an original tokenizer and model architecture from scratch, enabling multilingual (Chinese‑English and many other languages) embedding capabilities and cross‑language retrieval.
New Features :
Language support: from Chinese‑only SOTA to Chinese‑English SOTA and broader multilingual ability.
Cross‑language retrieval: mutual checking between Chinese and English.
Context length: increased from 512 tokens to 32 k tokens.
Base model: switched from a pretrained BERT to a custom‑trained Conan‑1.4B LLM.
Model Architecture : Conan‑1.4B consists of 8 attention layers, a hidden size of 3584, and a maximum context of 32 k tokens, totaling 1.4 B parameters while providing a high‑dimensional embedding space.
Training Process is divided into four stages:
LLM pre‑training : Approximately 3 T tokens of generic data are used, with additional pair‑wise data for embedding alignment. Standard data filtering from InternLM2 is applied.
Weak‑supervision embedding training : The same query‑positive pair data as LLM supervised fine‑tuning (SFT) is used, but with a different format and loss. Queries are formed from instructions and inputs, while the positive paragraph is the output. Data quality is ensured by scoring with gte‑Qwen2‑7B‑instruct and discarding samples below 0.4.
Supervised embedding training : Task‑specific fine‑tuning on four downstream tasks—retrieval, cross‑language retrieval, classification, and semantic textual similarity (STS). Retrieval‑type tasks use the classic InfoNCE loss; STS uses the CoSENT loss.
Dynamic hard negative mining and other strategies (see below) are applied throughout.
SoftMask Strategy : To bridge the gap between causal masks used in LLM training and bidirectional masks needed for embedding training, a soft‑mask mechanism is introduced. A scheduling function gradually transitions mask values from 0 to 1, allowing the model to adaptively learn attention weights. Additionally, a dynamic rank‑reduction technique sets the first k columns of the mask matrix to 1, controlling the rank and providing a regularization effect.
Cross‑lingual Retrieval Dataset (CLR) : To enable multilingual representation learning, a new dataset is built by translating queries from existing retrieval corpora (e.g., MS‑MARCO) into 26 languages using the Qwen2.5‑7B translation model. This yields roughly 10 million query‑document pairs that align representations across languages.
Dynamic Hard Negative Mining (DHNM) : Detailed methodology is described in the Conan‑Embedding‑V1 technical report (arXiv:2408.15710).
Data : Weak‑supervision data are harvested from news titles and article bodies, with extensive cleaning to remove low‑quality, duplicate, or harmful content. Supervised data cover five tasks: retrieval, re‑ranking, classification, clustering, and STS. A summary table of data usage is provided in the original article.
Experimental Results :
Main results : Conan‑Embedding‑V2 achieves SOTA on both English and Chinese MTEB benchmarks, excelling in CLS (English 91.11, Chinese 76.8) and ReRank (English 51.49, Chinese 73.69) tasks.
Ablation studies : Removing any component (SoftMask, CLR, DHNM) degrades performance, confirming the synergistic effect of the full framework.
Comparison with mainstream models : With 1.5 B parameters and a 3584‑dimensional embedding, Conan‑Embedding‑V2 offers a favorable trade‑off between model size, inference speed, and accuracy.
Conclusion and Outlook : The paper presents the complete pipeline from tokenizer training, LLM pre‑training, to embedding fine‑tuning. By integrating SoftMask, CLR, and DHNM, Conan‑Embedding‑V2 delivers SOTA performance while keeping the model compact and fast. The authors invite the community to explore further applications of Conan‑Embedding in search, recommendation, and RAG scenarios.
Tencent Technical Engineering
Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.