Application of Large-Scale Pretrained Models in Alibaba Machine Translation
This article reviews how large‑scale pretrained language models have reshaped NLP, outlines the challenges of applying them to machine translation, introduces the APT framework and the GRET architecture for better encoder‑decoder integration, and reports experimental gains and future research directions.
The rapid rise of large‑scale pretrained models such as ELMo, GPT, BERT, and ERNIE has dramatically advanced many NLP sub‑fields, yet their impact on machine translation (MT) and natural language generation (NLG) remains limited due to exposure bias and the mismatch between training and inference.
To bridge this gap, the authors propose the APT (Adaptive Pretraining‑Translation) framework, which dynamically fuses pretrained embeddings with MT embeddings on the encoder side and employs knowledge distillation on the decoder side to align the probability distributions of the two models.
The framework includes two key mechanisms: (1) a hierarchical attention‑based dynamic fusion that adjusts the proportion of pretrained and MT embeddings across layers, and (2) a knowledge‑distillation component that transfers token‑level and sentence‑level knowledge from the pretrained model to the MT model.
In addition, a novel Global Representation Extraction Transformer (GRET) is introduced, leveraging capsule networks and layer‑wise recurrent structures to capture global sentence‑level information that standard self‑attention mechanisms miss.
Extensive experiments on public WMT datasets (e.g., Chinese‑English, English‑German) demonstrate that APT improves three major MT tasks by 1.9–2.3 BLEU points, with BERT‑based encoders yielding the best results, while GRET reduces model parameters by 75% with comparable performance to Transformer‑Big.
Future work will explore multimodal extensions, multilingual pretraining for low‑resource languages, and applying the methods to document‑level and dialogue translation scenarios.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
