Artificial Intelligence 13 min read

Huawei Translation’s Achievements and Technical Solutions in IWSLT 2022 Speech Translation Tasks

This article reviews Huawei Translation’s top-ranking results in the IWSLT 2022 speech translation competition across speech‑to‑speech, offline speech‑to‑text, and length‑controlled translation tasks, and details their cascade and end‑to‑end technical approaches, including domain‑controlled ASR, context‑aware MT re‑ranking, and VITS‑based TTS.

DataFunTalk

Jul 7, 2022

Huawei Translation’s Achievements and Technical Solutions in IWSLT 2022 Speech Translation Tasks

The International Conference on Spoken Language Translation (IWSLT) is a leading competition that drives research in speech translation by providing public datasets and challenging tasks. In the 2022 edition, seven tasks were offered, including simultaneous speech‑to‑text, offline speech‑to‑text, low‑resource speech‑to‑text, speech‑to‑speech, dialect speech translation, length‑controlled speech translation, and speech style translation.

Huawei Translation achieved first place in four language directions across three tasks: speech‑to‑speech, offline speech translation, and length‑controlled speech translation, outperforming other systems in both automatic (BLEU, chrF) and human evaluations.

Two main technical routes exist for speech‑to‑speech translation (S2ST): an end‑to‑end model that directly maps source audio to target audio, and a cascade system that combines an ASR model, a machine‑translation (MT) model, and a text‑to‑speech (TTS) model. Industry practice currently favors the cascade approach.

End‑to‑end approach : The first academic end‑to‑end S2ST model, Translatotron, uses a Seq2Seq architecture to convert source spectrograms into target spectrograms, optionally preserving speaker characteristics via a speaker encoder and generating waveforms with a vocoder.

End‑to‑end models offer faster inference, avoid error propagation, retain speaker voice, and handle untranslated words better. Recent research includes Translatotron 2.0 and UWSpeech, but Huawei’s experiments showed that cascade systems still deliver higher translation quality.

Cascade approach (used in the competition) : Huawei built independent ASR, MT, and TTS models. Key innovations include:

1. Domain‑controlled ASR decoding : A domain tag (e.g., <MC> for MuST‑C) is added as a prefix token to guide the ASR model toward the desired domain style, reducing word‑error‑rate (WER) across test sets.

2. Context‑aware MT re‑ranking : Inspired by the noisy‑channel model, Huawei applies a sliding‑window language model to re‑score translation hypotheses, improving long‑sentence translation quality.

3. Pre‑trained VITS TTS : Huawei adopts the VITS model, which combines a VAE for prosody modeling and a Flow‑based vocoder for high‑fidelity waveform generation, enabling end‑to‑end speech synthesis without intermediate mel‑spectrograms.

The technologies described have been deployed in Huawei products such as HarmonyOS, HMS Core, and Huawei Cloud, providing features like photo translation, full‑screen translation, face‑to‑face simultaneous translation, and subtitle translation for both Huawei and non‑Huawei users.

References: [1] Findings of the IWSLT 2022 Evaluation Campaign. [2] The HW‑TSC’s Speech‑to‑Speech Translation System for IWSLT 2022. [3] The HW‑TSC’s Offline Speech Translation System for IWSLT 2022. [4] The HW‑TSC’s Offline Speech Translation Systems for IWSLT 2021. [5] Direct speech‑to‑speech translation with a sequence‑to‑sequence model. [6] Conditional Variational Autoencoder with Adversarial Learning for End‑to‑End TTS. [7] ESPnet2‑TTS: Extending the Edge of TTS Research. [8] YourTTS: Towards Zero‑Shot Multi‑Speaker TTS and Voice Conversion. [9] Style Tokens: Unsupervised Style Modeling, Control and Transfer in End‑to‑End Speech Synthesis. [10] Hierarchical Generative Modeling for Controllable Speech Synthesis. [11] Conditional End‑to‑End Audio Transforms. [12] MuST‑C: a Multilingual Speech Translation Corpus. [13] CoVoST: A Diverse Multilingual Speech‑To‑Text Translation Corpus. [14] TED‑LIUM 3: Twice as Much Data and Corpus Repartition for Speaker Adaptation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

TTS End-to-End cascade model ASR Huawei speech translation IWSLT MT

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.