Artificial Intelligence 18 min read

Technical Analysis of Huawei’s Offline Speech‑to‑Text and Length‑Constrained Speech Translation Systems in IWSLT 2022

This article reviews the IWSLT 2022 competition tasks, explains Huawei’s cascade offline speech‑to‑text translation pipeline, details four major technical innovations—including ensemble‑based ASR de‑noise, context‑aware re‑ranking, domain‑controlled training, and length‑control strategies—and presents experimental results that demonstrate Huawei’s leading performance across multiple language directions.

DataFunTalk
DataFunTalk
DataFunTalk
Technical Analysis of Huawei’s Offline Speech‑to‑Text and Length‑Constrained Speech Translation Systems in IWSLT 2022

The International Conference on Spoken Language Translation (IWSLT) is a leading benchmark for speech translation, offering several tasks such as offline speech‑to‑text translation, length‑constrained speech translation, speech‑to‑speech translation, and others. In the 2022 edition, Huawei’s system (HW‑TSC) achieved first place in four language directions for offline speech‑to‑text, length‑constrained, and speech‑to‑speech tasks.

Offline Speech‑to‑Text Translation

Offline speech translation can be implemented via two mainstream approaches: an end‑to‑end model that directly maps source audio to target text, or a cascade model that first performs automatic speech recognition (ASR) and then machine translation (MT). Huawei adopts the cascade approach, integrating a separately trained ASR model and an MT model.

The cascade pipeline faces several challenges:

Cascade error amplification: Errors from the ASR output are often magnified by the MT model.

Context consistency: Sentence‑level ASR segmentation can cause tense, name, and pronoun inconsistencies.

Domain mismatch in ASR: Different datasets lead to higher perplexity for the ASR model.

Domain mismatch in MT: Pre‑trained MT models are trained on generic data, which may not match the speech‑translation domain.

Huawei proposes four key solutions:

Technical point 1 – Ensemble‑based ASR de‑noise: Using the U2 model to detect and suppress noise, combined with heterogeneous model ensembling to improve ASR accuracy.

Technical point 2 – Context‑aware re‑ranking: A language model (GPT) re‑ranks beam‑search candidates based on preceding ASR outputs, reducing context inconsistencies; a similar re‑ranking is applied to MT outputs.

Technical point 3 – Domain‑controlled training & decoding: Domain tags (e.g., <MC> , <LS> ) are added as prefix tokens to guide the model toward domain‑specific styles.

Technical point 4 – Large‑scale pre‑training & domain fine‑tuning: A generic MT model is first trained on massive WMT data, then fine‑tuned on speech‑related corpora (e.g., TED) with regularization to avoid over‑fitting.

Length‑Constrained Speech Translation

To generate translations whose length matches the source, Huawei employs several strategies:

Technical point 1 – Low‑resource model enhancement: Model sharing, multilingual training (en‑de ↔ de‑en), R‑Drop, data diversification, and ensemble inference boost performance under limited data.

Technical point 2 – Length‑token strategy: Special tokens ( short , normal , long ) are prefixed to the source sentence to control output length.

Technical point 3 – Length‑encoding strategy: Positional encodings are modified so that the first token receives the source length, the second token receives length‑1, and so on, enabling direct length control.

Technical point 4 – Length‑controlled non‑autoregressive decoding (NAT): Token count of the source is used as the target token count; models such as HI‑CMLM and Diformer are trained with this constraint.

Technical point 5 – Length‑aware beam search and re‑ranking: An n‑best list (n=12) is generated, candidates with lengths closest to the source are selected, and ensemble re‑ranking picks the best final translation.

Experimental results on the tst‑COMMON test set show that the enhanced models significantly improve BLEU and BERTScore, while length‑control methods (Length‑Token, Length‑Encoding, NAT) achieve high length compliance (LC). Combining these methods with re‑ranking yields translations with both high quality and 100% length compliance.

References are provided for all cited datasets and methods, including IWSLT 2022 evaluation reports, HW‑TSC system papers, MuST‑C, CoVoST, TED‑LIUM, LibriSpeech, and various recent works on dropout regularization, data diversification, length control, and non‑autoregressive translation.

machine translationASRHuaweispeech translationIWSLTlength controloffline speech translation
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.