Boosting Chinese‑English Code‑Switching Speech Recognition with Language ID and LM Enhancements

This report details a series of experiments on Chinese‑English mixed‑language speech recognition, introducing language‑identification loss and language‑model integration to improve acoustic modeling, reduce mixed error rates, and achieve significant gains over a baseline end‑to‑end ASR system.

Zuoyebang Tech Team
Zuoyebang Tech Team
Zuoyebang Tech Team
Boosting Chinese‑English Code‑Switching Speech Recognition with Language ID and LM Enhancements

Research Background

Speech is a primary medium for human communication; with the rise of voice assistants and IoT, speech recognition becomes critical. Errors propagate and cause interaction failures, making research on speech recognition academically and practically valuable.

Industry Status

Current ASR systems focus on monolingual speech; they cannot handle Chinese‑English code‑switching. Developing a code‑switching speech recognition (CSSR) system is therefore essential.

Key challenges include non‑native accent influence, acoustic modeling difficulties due to differing phoneme sets, and scarcity of labeled mixed‑language data. Traditional monolingual acoustic models (e.g., DNN‑HMM) struggle with these issues, while end‑to‑end models can better capture language transition cues.

ZuoYeBang Practice

3.1 Experiment Details

We used ~1000 h of teacher‑lecture mixed English‑Chinese data, with 30 h for development, 6.7 h for testing, and the remainder for training. Chinese characters were modeled at the character level, while English used sub‑word units to balance acoustic duration.

Evaluation metrics: character error rate (CER) for Chinese, word error rate (WER) for English, combined as mixed error rate (MER).

3.2 Baseline Model

The baseline employed the Wenet framework, which adopts a Conformer encoder with joint CTC/attention loss. This architecture serves as the foundation for subsequent improvements.

3.3 Language‑ID Joint Training

We introduced language‑identification (LID) loss at both frame and token levels. Two LID variants were explored: shared‑attention (LID‑shared) and independent‑attention (LID‑indep). Experiments showed LID‑indep consistently outperformed the baseline, improving MER by ~1.76 % and achieving 98 % LID accuracy.

3.4 Language Model Enhancement

We integrated a language model (TLG) into the end‑to‑end system. Adding TLG reduced MER from 5.74 % to 5.59 % (≈2.7 % relative improvement) and lowered same‑substitution errors.

3.5 Final Experiment Comparison

To increase robustness, we added ~50 h of data and applied LID‑indep + TLG. The best configuration (CTC+Attention+LID‑indep+ +TLG) achieved MER 5.08 %, CH‑WER 4.23 %, EN‑WER 11.14 %.

After filtering out interjections and normalizing pronouns, the final system reduced MER by 6.96 % relative to the baseline, with CH‑WER down 6.41 % and EN‑WER down 8.24 %.

Conclusion and Outlook

Improvements stem from three aspects: (1) model‑level LID integration (+1.76 % relative MER gain), (2) data‑level augmentation (+3.1 % relative), and (3) language‑model integration (+2.5 % relative). Overall, the system outperforms the baseline by ~7.8 %.

Future work includes parameter tuning, pre‑training, and semi‑supervised generation of large mixed‑language corpora.

References

[1] Shan, C. et al., “Investigating end‑to‑end speech recognition for Mandarin‑English code‑switching,” ICASSP 2019.

[2] Yılmaz, E. et al., “Multi‑graph decoding for code‑switching ASR,” INTERSPEECH 2019.

[3] Chan, J.Y.C. et al., “Detection of language boundary in code‑switching utterances by bi‑phone probabilities,” 2004.

[4] Weiner, J. et al., “Integration of language identification into a recognition system for spoken conversations containing code‑switches,” 2012.

[5] Zeng, Z. et al., “On the end‑to‑end solution to Mandarin‑English code‑switching speech recognition,” arXiv 2018.

[6] Luo, N. et al., “Towards end‑to‑end code‑switching speech recognition,” arXiv 2018.

[7] Zeyer, A. et al., “Improved training of end‑to‑end attention models for speech recognition,” arXiv 2018.

[8] Yao, Z. et al., “Wenet: Production oriented streaming and non‑streaming end‑to‑end speech recognition toolkit,” arXiv 2021.

deep learningSpeech RecognitionCode-Switchinglanguage identification
Zuoyebang Tech Team
Written by

Zuoyebang Tech Team

Sharing technical practices from Zuoyebang

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.