From Technology to Experience: Vivo Machine Translation Deployment Practice
This article presents a comprehensive guide to deploying machine translation at Vivo, covering business analysis, algorithm choices beyond standard NMT, language detection challenges, data collection and cleaning, scientific evaluation methods, and engineering optimizations to deliver a seamless user experience.
The article begins with an overview of the growing demand for mobile translation, noting that users spend 6‑7 hours daily on phones and desire a translation workflow that can be completed in two steps to improve experience.
1. Understanding Business Needs – Before building translation capabilities, it is essential to identify core user groups and high‑frequency scenarios, such as watching foreign media, chatting with overseas friends, and reading international news, and to tailor translation services accordingly.
2. Algorithm Beyond NMT – While Neural Machine Translation (NMT) models form the core, the author highlights mature open‑source frameworks, the importance of preprocessing (tokenization, BPE), and additional modules like language detection, tokenization, truecasing, detokenization, and sentence splitting.
3. Language Detection – Discusses the multi‑class classification problem of detecting over 100 languages, handling data sparsity, short texts, mixed‑language sentences, and similar language families.
4. Text Pre‑processing – Covers Chinese word segmentation, English tokenization, handling punctuation, truecasing/detruecasing, and recasing for low‑resource scenarios, emphasizing the correct order of operations.
5. Sentence and Mixed‑Language Splitting – Describes splitting long sentences and separating mixed‑language segments to avoid corrupting native language fragments during translation.
6. Model Domain Adaptation – Presents two methods: fine‑tuning with domain‑specific data and adding adapter layers to encoders/decoders, enabling the model to generate more domain‑appropriate translations.
7. Robustness Enhancements – Lists techniques such as random ground‑truth word replacement, label smoothing, adversarial sample generation, and random data perturbations to improve model stability.
8. Data Quality and Quantity – Emphasizes that data determines the upper bound of translation quality; includes analysis of dataset length distribution, domain distribution, and cleaning steps for both monolingual and bilingual corpora.
9. Evaluation – Stresses the need for both objective metrics (multiple BLEU test sets) and subjective human evaluation, especially when BLEU plateaus, to assess fluency, adequacy, and style.
10. Engineering Practices – Covers performance optimization (sentence splitting, caching, model compression, batch inference), rapid online bug fixing via retrieval libraries and translation intervention, and establishing a closed‑loop iteration pipeline that incorporates logs, error analysis, and continuous model updates.
The conclusion reiterates that technology serves product goals, urging practitioners to keep user experience at the forefront when optimizing machine translation systems.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.