ML-Embed’s 3D‑ML Framework Breaks the Three Barriers of Multilingual Embeddings
The paper presents ML-Embed, a 3D‑ML framework that tackles the high computational cost, language‑coverage imbalance, and research opacity of multilingual text‑embedding models by introducing MEL, MLL, and MRL techniques, releasing a 50 M‑sample dataset covering 282 languages, and achieving SOTA on nine MTEB benchmarks while remaining fully open‑source.
Problem Statement
Multilingual text‑embedding serves over 7,000 languages, yet speakers of Polish, Vietnamese, Persian, Hindi and many others receive less than one‑tenth the performance of English speakers. Three structural barriers cause this gap:
Computational cost : Training embedding models on large LLMs such as Qwen‑3 requires hundreds of billions of parameters, beyond the reach of most research groups.
Language coverage imbalance : As of February 2026 only one model has a complete MTEB evaluation for Polish, while English and multilingual leaderboards contain nearly 150 models each.
Research opacity : Top‑tier models are either closed‑source APIs or release weights without training details, hindering reproducibility.
3D‑ML (3‑Dimensional Matryoshka Learning)
ML‑Embed introduces a unified training framework that extends the Matryoshka (nesting‑doll) principle to three orthogonal dimensions, enabling end‑to‑end compression across training, inference and storage.
Dimension 1 – MEL (Matryoshka Embedding Learning)
The embedding layer of a Qwen‑3‑0.6B‑based model consumes one‑quarter of total parameters. MEL performs an SVD on the original embedding matrix, truncates it into two low‑rank factors, and updates only these small matrices during training. During each forward pass MEL dynamically samples a sub‑rank, forcing the model to pack the most critical information into the innermost dimensions.
Deployment options:
Compatibility mode : Multiply the two factors to reconstruct the full matrix; no code changes are required.
Efficiency mode : Deploy the low‑rank factors directly, drastically reducing memory footprint for edge or resource‑constrained scenarios.
Compared with LoRA, MEL reduces both trainable and inference‑time parameters, achieving true end‑to‑end compression.
Dimension 2 – MLL (Matryoshka Layer Learning)
MLL applies loss functions to multiple intermediate layers simultaneously, allowing shallow sub‑models to perform the embedding task independently. At inference, adjusting the num_hidden_layers configuration yields models of varying depth without retraining or pruning, providing one model, N depths .
A logarithmic set of layer counts (e.g., {1, 2, 4, 8, 16, 32}) ensures coverage from shallow to deep, with each layer’s output normalized to maintain representation consistency.
Dimension 3 – MRL (Matryoshka Representation Learning)
Inspired by a 2022 NeurIPS work, MRL jointly optimizes prefixes of different vector lengths during training so that truncated short vectors remain effective. Within 3D‑ML, MRL is tightly integrated with MLL: each MLL layer’s output receives multiple MRL contrast losses.
Unified Loss Function
The three components are jointly optimized, sharing a common representation function for each layer under each dimension.
Data Collection
ML‑Embed assembles a multilingual dataset from 121 public sources, totaling 50 M training samples that cover 282 natural languages (ISO‑639‑3) and over 40 programming languages. By contrast, the KaLM‑Embedding dataset is dominated by English (49.4 %) and Chinese (44.4 %).
Training Strategy
Stage 1 : Pre‑heat on ~27 M large‑scale retrieval samples to build basic semantic understanding.
Stage 2 : Fine‑tune on a mixed sample of ~8.3 M examples, injecting task instructions to boost multi‑task adaptability.
The total training data for ML‑Embed is only about one‑fifth of that used by comparable SOTA models.
Experimental Results
Across 17 MTEB benchmarks (430 tasks), the 8 B‑parameter ML‑Embed model sets new SOTA on nine tasks and attains Top‑5 performance on English and multilingual leaderboards, showing a clear scaling trend from 140 M to 8 B parameters.
Low‑resource language gains (improvement over baselines):
Polish +22.89
Vietnamese +6.88
Hindi family +6.61
German +6.47
Japanese +4.63
Dutch +4.26
Nordic languages +3.93
European languages +4.40
French +1.54
Ablation Studies
MLL + MEL Synergy
Using MLL alone yields depth‑adjustable models at a modest cost to shallow performance. Adding MEL compresses the embedding layer, allowing deeper models under the same parameter budget. A 4‑layer MLL + MEL model (~170 M parameters) outperforms a 1‑layer baseline by 15 points while being three times smaller at equal performance.
Robustness of MEL (SVD Compression)
Direct SVD on a baseline model drops performance from 69.68 to 53.25 (catastrophic).
Training with only the decomposed form (no nesting objective) improves robustness but still degrades as rank decreases.
MEL‑trained models retain a high score of 64.30 even when reduced to rank 64, with a very gentle degradation curve.
The nesting objective forces critical information into the leading low‑rank dimensions, underpinning this robustness.
Data Comparison
Training a 0.6 B model on ML‑Embed data outperforms KaLM‑Embedding on nine of the 17 benchmarks, especially on code tasks. KaLM‑Embedding shows an advantage on Chinese due to its data bias. For the remaining seven benchmarks (Korean, Polish, Dutch, Hindi, etc.) performance is comparable, demonstrating that broader language coverage does not sacrifice mainstream language quality.
Generalization to Other Architectures
Experiments on EuroBERT‑210M compare three settings:
EuroBERT baseline (210 M): average score 60.38
Structural pruning to 120 M + fine‑tune: average score 44.10
3D‑ML trained then pruned to 120 M: average score 56.77
The 3D‑ML‑pruned model exceeds direct pruning by 12.67 points and incurs only a 3.61‑point drop from the baseline, confirming the framework’s broad applicability.
Open‑Source Release
ML‑Embed provides fully open‑source training code, model weights, dataset and the paper.
Training code: https://github.com/codefuse-ai/CodeFuse-Embeddings
Model weights & dataset: https://huggingface.co/collections/codefuse-ai/codefuse-embeddings
Paper: https://arxiv.org/abs/2605.15081Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
