How Alibaba Tackles Low-Resource Language Data for Multilingual LLMs

Alibaba International’s senior data science expert explains a systematic five‑strategy solution—data acquisition, augmentation, quality optimization, engineering pipeline, and evaluation loop—to overcome data scarcity, high annotation cost, and processing challenges for low‑resource languages in multilingual large language models.

DataFunTalk
DataFunTalk
DataFunTalk
How Alibaba Tackles Low-Resource Language Data for Multilingual LLMs

Industry Background and Technical Challenges

General‑purpose large models perform well on high‑resource languages but struggle with low‑resource languages such as Southeast Asian minority languages due to data scarcity, immature annotation techniques, and high processing costs.

Five‑Strategy Solution

1. Data Acquisition

Integrate open‑source multilingual web data, collaborate with specialized data providers, filter parallel corpora (e.g., OPUS, CCAligned), and leverage business‑generated data.

2. Data Augmentation

Use synthetic data through in‑context learning, multilingual translation, and model distillation techniques.

3. Quality Optimization

Build an automated pipeline that combines rule‑based methods, small models, and large models for language identification, multi‑dimensional deduplication, and quality scoring.

4. Engineering Architecture

Design a five‑stage distributed pipeline (parse → standardize → label → deduplicate → construct) using MaxCompute and FaaS platforms to improve processing efficiency.

5. Evaluation Loop

Introduce the TransBench three‑level evaluation framework (basic language, domain expertise, cultural adaptation) and the Marco‑MOS model, which outperforms GPT‑4 in multilingual e‑commerce translation assessment.

Technical Solutions and Data Engineering

Data collection relies on massive open‑source multilingual web data and partnerships with data companies; parallel corpora are filtered using special symbols, stop‑word ratios, digit ratios, and LASER similarity scores. Synthetic data and model distillation are applied in CT, SFT, and DPO training, using in‑context learning and state‑of‑the‑art multilingual models.

Automated cleaning combines rule engines and AI quality checks, drawing on practices from CommonCrawl datasets (C4, RefinedWeb, SlimPajama, FineWeb). Tagging spans characters to collections, integrating rules, small models, and large models, while multi‑dimensional deduplication removes document‑level, dataset‑level, and web‑data duplicates.

Language identification is enhanced with optimized models, achieving higher accuracy for downstream filtering, grammar, semantics, punctuation, and completeness checks via automated scoring.

Distributed Computing Architecture

A five‑stage pipeline processes 10 trillion tokens, deploying simple rule‑based operators as UDFs on MaxCompute and GPU‑intensive deep‑learning operators on an EGS FaaS cluster, unified through MaxCompute scheduling.

Data‑Model Iteration Loop

Model releases undergo benchmark evaluation (general and domain‑specific) to guide targeted data collection for subsequent training, forming a closed “data development → model training → model evaluation” loop.

Evaluation Framework

TransBench evaluates multilingual translation models across basic ability, domain expertise (using the Marco‑MOS model trained on curated e‑commerce data), and cultural adaptation (handling taboo words and honorifics). The framework is publicly available on the OpenCompass platform.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

data engineeringAIModel Evaluationlow-resource languagesmultilingual LLM
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.