Artificial Intelligence 12 min read

ML-Embed’s 3D‑ML Framework Breaks the Three Barriers of Multilingual Embeddings

The paper presents ML-Embed, a 3D‑ML framework that tackles the high computational cost, language‑coverage imbalance, and research opacity of multilingual text‑embedding models by introducing MEL, MLL, and MRL techniques, releasing a 50 M‑sample dataset covering 282 languages, and achieving SOTA on nine MTEB benchmarks while remaining fully open‑source.

PaperAgent

Jun 15, 2026

ML-Embed’s 3D‑ML Framework Breaks the Three Barriers of Multilingual Embeddings

Problem Statement

Multilingual text‑embedding serves over 7,000 languages, yet speakers of Polish, Vietnamese, Persian, Hindi and many others receive less than one‑tenth the performance of English speakers. Three structural barriers cause this gap:

Computational cost : Training embedding models on large LLMs such as Qwen‑3 requires hundreds of billions of parameters, beyond the reach of most research groups.

Language coverage imbalance : As of February 2026 only one model has a complete MTEB evaluation for Polish, while English and multilingual leaderboards contain nearly 150 models each.

Research opacity : Top‑tier models are either closed‑source APIs or release weights without training details, hindering reproducibility.

3D‑ML (3‑Dimensional Matryoshka Learning)

ML‑Embed introduces a unified training framework that extends the Matryoshka (nesting‑doll) principle to three orthogonal dimensions, enabling end‑to‑end compression across training, inference and storage.

Dimension 1 – MEL (Matryoshka Embedding Learning)

The embedding layer of a Qwen‑3‑0.6B‑based model consumes one‑quarter of total parameters. MEL performs an SVD on the original embedding matrix, truncates it into two low‑rank factors, and updates only these small matrices during training. During each forward pass MEL dynamically samples a sub‑rank, forcing the model to pack the most critical information into the innermost dimensions.

Deployment options:

Compatibility mode : Multiply the two factors to reconstruct the full matrix; no code changes are required.

Efficiency mode : Deploy the low‑rank factors directly, drastically reducing memory footprint for edge or resource‑constrained scenarios.

Compared with LoRA, MEL reduces both trainable and inference‑time parameters, achieving true end‑to‑end compression.

Dimension 2 – MLL (Matryoshka Layer Learning)

MLL applies loss functions to multiple intermediate layers simultaneously, allowing shallow sub‑models to perform the embedding task independently. At inference, adjusting the num_hidden_layers configuration yields models of varying depth without retraining or pruning, providing one model, N depths .

A logarithmic set of layer counts (e.g., {1, 2, 4, 8, 16, 32}) ensures coverage from shallow to deep, with each layer’s output normalized to maintain representation consistency.

Dimension 3 – MRL (Matryoshka Representation Learning)

Inspired by a 2022 NeurIPS work, MRL jointly optimizes prefixes of different vector lengths during training so that truncated short vectors remain effective. Within 3D‑ML, MRL is tightly integrated with MLL: each MLL layer’s output receives multiple MRL contrast losses.

Unified Loss Function

The three components are jointly optimized, sharing a common representation function for each layer under each dimension.

Data Collection

ML‑Embed assembles a multilingual dataset from 121 public sources, totaling 50 M training samples that cover 282 natural languages (ISO‑639‑3) and over 40 programming languages. By contrast, the KaLM‑Embedding dataset is dominated by English (49.4 %) and Chinese (44.4 %).

Training Strategy

Stage 1 : Pre‑heat on ~27 M large‑scale retrieval samples to build basic semantic understanding.

Stage 2 : Fine‑tune on a mixed sample of ~8.3 M examples, injecting task instructions to boost multi‑task adaptability.

The total training data for ML‑Embed is only about one‑fifth of that used by comparable SOTA models.

Experimental Results

Across 17 MTEB benchmarks (430 tasks), the 8 B‑parameter ML‑Embed model sets new SOTA on nine tasks and attains Top‑5 performance on English and multilingual leaderboards, showing a clear scaling trend from 140 M to 8 B parameters.

Low‑resource language gains (improvement over baselines):

Polish +22.89

Vietnamese +6.88

Hindi family +6.61

German +6.47

Japanese +4.63

Dutch +4.26

Nordic languages +3.93

European languages +4.40

French +1.54

Ablation Studies

MLL + MEL Synergy

Using MLL alone yields depth‑adjustable models at a modest cost to shallow performance. Adding MEL compresses the embedding layer, allowing deeper models under the same parameter budget. A 4‑layer MLL + MEL model (~170 M parameters) outperforms a 1‑layer baseline by 15 points while being three times smaller at equal performance.

Robustness of MEL (SVD Compression)

Direct SVD on a baseline model drops performance from 69.68 to 53.25 (catastrophic).

Training with only the decomposed form (no nesting objective) improves robustness but still degrades as rank decreases.

MEL‑trained models retain a high score of 64.30 even when reduced to rank 64, with a very gentle degradation curve.

The nesting objective forces critical information into the leading low‑rank dimensions, underpinning this robustness.

Data Comparison

Training a 0.6 B model on ML‑Embed data outperforms KaLM‑Embedding on nine of the 17 benchmarks, especially on code tasks. KaLM‑Embedding shows an advantage on Chinese due to its data bias. For the remaining seven benchmarks (Korean, Polish, Dutch, Hindi, etc.) performance is comparable, demonstrating that broader language coverage does not sacrifice mainstream language quality.

Generalization to Other Architectures

Experiments on EuroBERT‑210M compare three settings:

EuroBERT baseline (210 M): average score 60.38

Structural pruning to 120 M + fine‑tune: average score 44.10

3D‑ML trained then pruned to 120 M: average score 56.77

The 3D‑ML‑pruned model exceeds direct pruning by 12.67 points and incurs only a 3.61‑point drop from the baseline, confirming the framework’s broad applicability.

Open‑Source Release

ML‑Embed provides fully open‑source training code, model weights, dataset and the paper.

Training code: https://github.com/codefuse-ai/CodeFuse-Embeddings
Model weights & dataset: https://huggingface.co/collections/codefuse-ai/codefuse-embeddings
Paper: https://arxiv.org/abs/2605.15081

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Open Source MTEB multilingual embeddings MRL 3D-ML MEL ML-Embed MLL

Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Problem Statement

3D‑ML (3‑Dimensional Matryoshka Learning)

Dimension 1 – MEL (Matryoshka Embedding Learning)

Dimension 2 – MLL (Matryoshka Layer Learning)

Dimension 3 – MRL (Matryoshka Representation Learning)

Unified Loss Function

Data Collection

Training Strategy

Experimental Results

Ablation Studies

MLL + MEL Synergy

Robustness of MEL (SVD Compression)

Data Comparison

Generalization to Other Architectures

Open‑Source Release

PaperAgent

How this landed with the community

Was this worth your time?

0 Comments

Dimension 1 – MEL (Matryoshka Embedding Learning)

Dimension 2 – MLL (Matryoshka Layer Learning)

Dimension 3 – MRL (Matryoshka Representation Learning)

MLL + MEL Synergy