How Alibaba Tackles Low-Resource Language Data for Multilingual LLMs

In this interview, Alibaba International’s senior data‑science expert Li Haijun explains the challenges of low‑resource languages for multilingual large models and details a five‑step data‑collection, augmentation, quality‑optimization, engineering, and evaluation framework that powers their cross‑border e‑commerce AI applications.

DataFunSummit
DataFunSummit
DataFunSummit
How Alibaba Tackles Low-Resource Language Data for Multilingual LLMs

Li Haijun, senior data‑science expert at Alibaba International, outlines a systematic solution to improve multilingual large‑model performance on low‑resource languages.

Data collection: integrate open‑source multilingual web data, cooperate with specialized data providers, filter parallel corpora (e.g., OPUS, CCAligned), and leverage business‑accumulated data.

Data augmentation: use synthetic data via In‑Context learning, multilingual translation, and model distillation techniques.

Quality optimization: build an automated processing pipeline that combines rules, small models, and large models for language identification, multi‑dimensional deduplication, and quality scoring.

Engineering architecture: design a five‑stage distributed pipeline (parsing → standardization → labeling → deduplication → construction) using MaxCompute and FaaS platforms to boost processing efficiency.

Evaluation linkage: introduce the TransBench three‑level evaluation framework (basic language, domain expertise, cultural adaptation) and close the “data‑R&D → training → evaluation” loop, with the Marco‑MOS model outperforming GPT‑4.

The approach has been applied to cross‑border e‑commerce translation and product understanding, and is shared via the OpenCompass platform to foster a low‑resource language ecosystem.

Interview Highlights

Data challenges for low‑resource languages

Li explains that while large models excel on high‑resource languages, they struggle with Southeast Asian minority languages due to data scarcity, immature annotation techniques, and high processing costs.

Effective data‑collection strategies

Alibaba uses massive open‑source multilingual web data, partners with specialized data companies, and filters parallel corpora with criteria such as special symbols, stop‑words, digit ratios, and LASER similarity scores to obtain high‑quality parallel sentences.

Data synthesis and model distillation

For specific languages, synthetic data and model distillation are employed; SFT data is enriched via In‑Context learning and multilingual translation using state‑of‑the‑art models.

Automated quality improvement

They adopt industry‑wide practices from datasets like C4, RefinedWeb, SlimPajama, and FineWeb, enhancing labeling from characters to collections, combining rules, small and large models for filtering, and implementing multi‑dimensional deduplication and language‑identification scoring.

Multimodal data alignment

A unified multimodal processing framework extracts and aligns features from text, images, video, and audio, using rule‑based, small‑model, and large‑model annotation, and cross‑modal embeddings to create a shared semantic space for tasks such as multilingual image translation and video content understanding.

Distributed computing architecture

The five‑stage pipeline runs on MaxCompute for rule‑based operators and on an EGS FaaS cluster for GPU‑intensive deep‑learning operators, with unified scheduling across both platforms.

Data‑model‑evaluation feedback loop

Model iterations are guided by general and domain‑specific benchmarks; evaluation results drive targeted data collection for the next training round, ensuring alignment between data, model, and business goals.

Open collaboration and future challenges

Alibaba is collaborating with institutions like Shanghai AI Lab to co‑build low‑resource language data platforms and anticipates future challenges such as continuous data collection, online/near‑online learning, and automated business‑centric evaluation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIlarge language modelslow-resource languagesmultilingual LLM
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.