Artificial Intelligence 17 min read

iQIYI Multi‑Language Subtitle Machine Translation: Practice, Model Exploration, and Deployment

iQIYI’s multi‑language subtitle machine‑translation system combines a one‑to‑many transformer, context‑fusion encoding, four custom attention masks, masked language modeling, global decoding loss, reconstruction and error‑correction modules, plus pronoun, idiom and name‑handling tricks, achieving higher quality than third‑party services and even surpassing human translation for several languages.

iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Multi‑Language Subtitle Machine Translation: Practice, Model Exploration, and Deployment

On July 3, iQIYI's technology product team hosted the 16th "i Technology Conference" (i技术会) with the theme "NLP and Search". Experts from ByteDance, Qunar, and Tencent were invited to discuss the synergy between NLP and search, and iQIYI expert Zhang Xuanwei presented the practice of multi‑language subtitle machine translation.

The presentation was divided into three parts: background of iQIYI's multi‑language subtitle translation, model exploration and optimization, and real‑world deployment.

Background : Since June 2019 iQIYI launched the iQIYI App for global users, supporting overseas market expansion. Subtitles for long‑form video are a critical component, requiring translation into many languages (Thai, Vietnamese, Indonesian, Malay, Spanish, Arabic, etc.). Subtitle translation has unique challenges: short sentences with high ambiguity, OCR/ASR errors, heavy reliance on pronouns, and the need for video‑scene context.

Model Exploration :

One‑to‑many translation model: a single model shares parameters across multiple target languages, reducing training and maintenance costs and leveraging cross‑language transfer learning.

Context fusion: a BERT‑style encoder concatenates the previous and next subtitle lines with the central sentence, using special segment embeddings (EA, EB, EC, ED) to distinguish language token, content token, and surrounding context. The surrounding sentences are masked during decoding to avoid interference.

Enhanced encoder attention: four attention variants (global, local, forward, backward) are introduced via different mask strategies to force each head to learn distinct features.

Masked Language Modeling (MLM): random tokens are masked and reconstructed, with the MLM loss weighted and added to the overall loss to improve textual understanding.

Global decoding loss: the decoder predicts a global embedding (e.g., the average embedding of the whole sentence) for each token, encouraging the model to plan ahead rather than rely solely on previously generated tokens.

Reconstruction module: a reverse‑translation decoder reconstructs the source sentence from the decoder output, reducing under‑translation and over‑translation.

Error‑correction (T‑TA) module: based on a transformer with diagonal mask so each token can only see its neighbors, enabling correction of OCR/ASR errors.

Pronoun translation: video‑scene information (face and voice recognition) is used to align subtitles with characters, and character attributes (gender, age, relationship) are encoded to guide correct pronoun translation.

Idiom translation: pre‑trained BERT encodes Chinese idioms and their definitions, which are injected into the encoder to preserve idiomatic meaning.

Character name translation: special tags and data augmentation are applied so that names (often kept as pinyin) are copied correctly across languages.

Deployment and Results : After the optimizations, iQIYI evaluated quality‑control error rates across languages. The in‑house model consistently outperformed third‑party services and, in some languages (Malay, Spanish, Arabic), even surpassed human translation. The system now supports translation from Simplified Chinese to Indonesian, Malay, Thai, Vietnamese, Arabic, Traditional Chinese, and others, and is used in the "International Site Video Export" project.

Figure 1: Transformer model used as the base architecture.

Figure 2: Input format that adds language token and segment embeddings.

Figure 3: Fusion of previous and next subtitle lines with the central sentence.

Figure 4: Global, local, forward, and backward attention masks.

Figure 5: MLM training objective.

Figure 6: Global loss encourages future‑aware decoding.

Figure 7: Diagonal mask for the T‑TA error‑correction encoder.

Figure 10: Chinese‑Thai pronoun correspondence table.

Figure 13: Example of name copying with special tags.

Overall, the multi‑language subtitle translation system demonstrates how a combination of one‑to‑many modeling, context fusion, customized attention, MLM, reconstruction, error‑correction, and domain‑specific enhancements (pronouns, idioms, character names) can achieve high‑quality translation for global video content.

transformermachine translationiQIYIerror correctionmultilingual NLPOne-to-Many ModelSubtitle Translation
iQIYI Technical Product Team
Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.