Artificial Intelligence 12 min read

Comparative Evaluation of Deepl and ChatGPT Machine Translation for Game Localization

This article investigates the translation quality of Deepl and ChatGPT for the game 'Naraka: Bladepoint' by comparing their outputs against professional human translations across Chinese‑English, Chinese‑Spanish, and English‑Spanish pairs using BLEU scores and manual assessment, revealing strengths and limitations of each system.

NetEase LeiHuo Testing Center
NetEase LeiHuo Testing Center
NetEase LeiHuo Testing Center
Comparative Evaluation of Deepl and ChatGPT Machine Translation for Game Localization

Recent advances in artificial intelligence have expanded the scope of AIGC (AI‑generated content), with tools like ChatGPT demonstrating capabilities in natural language processing, code generation, and content creation. In the gaming industry, AIGC can assist with AI‑generated art, voice‑overs, copywriting, and even programming.

This study evaluates whether ChatGPT’s translation function can be applied to real‑world game localization by selecting four representative text groups from the Chinese game Naraka: Bladepoint (skill descriptions, story background, action descriptions, and literary style). Human translations serve as the reference standard, and the machine translation outputs of Deepl and ChatGPT (version 3.5) are compared using BLEU scores and manual evaluation.

Test preparation

ChatGPT version: 3.5

Machine translation tool: Deepl (chosen for its generally higher accuracy on technical and academic texts)

Test languages: Chinese ↔ English, Chinese ↔ Spanish, English ↔ Spanish

Evaluation metric: BLEU score (the most widely used automatic metric for MT quality)

Test method

The four text groups were translated in three directions (Chinese→English, Chinese→Spanish, English→Spanish) by both Deepl and ChatGPT. BLEU scores were calculated for each output against the human reference, and a manual review examined grammar, terminology, idioms, cultural references, and literary quality.

Results and analysis

Overall, both systems achieved only one BLEU score above 40, indicating that current MT quality is still far from professional standards.

Deepl outperformed ChatGPT in 7 out of 12 BLEU evaluations, showing higher similarity to human translations.

English→Spanish translations scored higher than Chinese→Spanish for both tools, likely due to larger English‑Spanish corpora and closer linguistic families.

Grammar was generally acceptable for both systems, but subjective judgments (e.g., correct subject selection in skill descriptions) favored human translators.

Terminology, idioms, cultural references, and mythological allusions were often mistranslated or overly literal, with examples such as "单双排" rendered as "single and double rows" (Deepl) and "single and double formations" (ChatGPT) instead of the correct "Solo and Duos".

Literary passages lost poetic nuance; Deepl and ChatGPT produced straightforward renderings lacking the original’s aesthetic depth.

The analysis confirms that while machine translation can handle basic grammatical structures, it struggles with domain‑specific terminology, cultural nuances, and literary style. Consequently, human post‑editing remains essential for high‑quality game localization.

Conclusion

At the current stage, Deepl and ChatGPT demonstrate solid grammatical performance but fall short in handling game‑specific terms, idioms, cultural background, and literary expression. The practical workflow should still prioritize human translators with machine translation serving as an auxiliary tool. As models continue to evolve, ChatGPT’s potential may increase, but reliable, nuanced localization will likely remain a collaborative effort between humans and AI.

References

https://www.letsmt.eu/Bleu.aspx

https://cloud.tencent.com/developer/article/1159767

https://arxiv.org/pdf/2301.08745.pdf

ChatGPTAIGCmachine translationlocalizationBLEUDeeplgame industry
NetEase LeiHuo Testing Center
Written by

NetEase LeiHuo Testing Center

LeiHuo Testing Center provides high-quality, efficient QA services, striving to become a leading testing team in China.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.