How Alibaba’s GTE‑Multilingual Models Boost RAG with Long‑Doc and Multi‑Language Support
Alibaba's Tongyi Lab introduces the GTE‑Multilingual series, high‑performance encoder‑only models that support 8k‑token texts, 75 languages, elastic and sparse embeddings, and demonstrate superior retrieval‑augmented generation performance across multilingual and long‑document benchmarks.
Background
Retrieval‑Augmented Generation (RAG) combines retrieval and generation to let large models answer queries using external knowledge bases, improving accuracy, reducing hallucinations, and enhancing real‑time response while addressing privacy concerns.
RAG relies on two key modules: a text representation (embedding) model that encodes documents into vectors for efficient similarity search, and a reranker model that scores document‑query pairs for finer ranking, typically applied to a small candidate set.
As RAG evolves, demands for multilingual retrieval, cross‑language search, and long‑document handling have grown, prompting the development of more capable models.
Model Construction
The GTE‑Multilingual (mGTE) series builds on a new encoder‑only base model supporting long contexts and 75 languages. Key features include:
High performance : outperforms same‑size open‑source models on multiple benchmarks.
Long‑document support : handles up to 8k tokens, extensible via ntk‑rope.
Multilingual support : covers 75 languages.
Elastic embedding : outputs vectors of 128‑768 dimensions, balancing storage and performance.
Sparse embedding : provides word‑level weights for precise matching.
Base Model Pre‑training
The base encoder‑only model was trained on 1,028 B tokens from public corpora (C4, mC4, Wikipedia, books, etc.) using multilingual tokenization. Training employed masked language modeling, dynamic batch sizing, unpadding, and BF16 precision.
Training Enhancements
Position encoding switched to RoPE for longer contexts.
FFN replaced with GLU activation.
Vocabulary from XLM‑Roberta.
Data sampling ensured each batch contained a single language, with probabilities proportional to language token counts: p_i = n_i / \sum_j n_j Multi‑stage pre‑training first used 2k‑token sequences, then extended to 8k tokens, increasing RoPE base to 160,000.
Unpadding avoided computation on padded tokens, and dynamic batch sizes grouped documents by length, using gradient checkpointing to boost efficiency.
Embedding Model
Training follows a two‑stage paradigm: weak‑supervision on ~2.8 B multilingual text pairs (titles‑body, QA) followed by supervised fine‑tuning on high‑quality annotated datasets (≈2 M Chinese pairs, 1.4 M English pairs, plus multilingual sets).
Two additional representation traits are introduced:
Elastic dimension : models output vectors of variable dimensions (e.g., 128‑768) to trade off storage and accuracy.
Sparse vectors : a linear layer on the final token layer produces word‑level weights; similarity is computed via weighted token overlap, enhancing exact‑match scenarios.
The overall loss combines dense (MRL) and sparse components.
Loss = \alpha \cdot L_{MRL} + (1-\alpha) \cdot L_{Sparse}Hard negatives are mined from the weak‑supervision model and intra‑batch negatives, while dynamic batch sizes further improve training speed.
Ranking Model
The reranker is trained with contrastive loss using only supervised data, as weak‑supervision offers limited gains for ranking. It takes text pairs as input and shares hyper‑parameters with the embedding model.
Evaluation
Both embedding and reranker models were evaluated on multilingual and long‑document benchmarks (XTREME‑R, GLUE, MLDR, MIRACL, MKQA, BEIR, LoCo, MTEB). mGTE consistently outperformed same‑size open‑source models and approached larger LLM‑based systems, especially in long‑document and multilingual scenarios.
Elastic embeddings showed minimal performance loss when reducing dimensions above 512, similar to OpenAI’s elastic models.
Model Usage
Example code for the embedding model (requires transformers ≥ 4.36.0):
# Requires transformers>=4.36.0
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
input_texts = [
"what is the capital of China?",
"how to implement quick sort in python?",
"北京",
"快排算法介绍"
]
model_path = 'Alibaba-NLP/gte-multilingual-base'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
batch_dict = tokenizer(input_texts, max_length=8192, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
dimension = 768 # output dimension in [128, 768]
embeddings = outputs.last_hidden_state[:, 0][:dimension]
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())Example code for the reranker model:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('Alibaba-NLP/gte-multilingual-reranker-base')
model = AutoModelForSequenceClassification.from_pretrained('Alibaba-NLP/gte-multilingual-reranker-base', trust_remote_code=True)
model.eval()
pairs = [["中国的首都在哪儿", "北京"], ["what is the capital of China?", "北京"], ["how to implement quick sort in python?", "Introduction of quick sort"]]
with torch.no_grad():
inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
scores = model(**inputs, return_dict=True).logits.view(-1).float()
print(scores)Conclusion
The GTE‑Multilingual series provides open‑source, encoder‑only models that excel in multilingual, long‑document retrieval and ranking tasks while remaining inference‑efficient. Available on ModelScope and HuggingFace, these models support RAG pipelines and offer elastic and sparse embedding options for diverse applications.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
