Artificial Intelligence 14 min read

Didi's Attention-Based End-to-End Mandarin Speech Recognition: A Detailed Review

Didi’s attention‑based end‑to‑end Mandarin speech recognizer, built on the Listen‑Attend‑Spell architecture and modeling roughly 5,000 common characters, delivers 15‑25% relative accuracy gains over its prior LSTM‑CTC system while cutting model size, latency and server requirements and simplifying training by eliminating separate acoustic, pronunciation and language components.

Didi Tech
Didi Tech
Didi Tech
Didi's Attention-Based End-to-End Mandarin Speech Recognition: A Detailed Review

AI Frontline presents a thorough analysis of Didi's recent arXiv paper titled “A comparable study of modeling units for end-to-end Mandarin speech recognition” . The paper reports that Didi’s attention‑based system directly models over 5,000 common Chinese characters as units, achieving a 15%–25% relative performance gain over its previous LSTM‑CTC system.

Original paper: https://arxiv.org/pdf/1805.03832.pdf

The article outlines the evolution of speech recognition:

Deep Neural Network – Hidden Markov Model (DNN‑HMM) based systems.

Connectionist Temporal Classification (CTC) based end‑to‑end systems.

Attention‑based end‑to‑end systems.

Since 2010, researchers such as Dong Yu and Li Deng introduced CD‑DNN‑HMM models, surpassing traditional GMM‑HMM by over 20% relative improvement. The success of LSTM‑CTC models further shifted focus to fully end‑to‑end approaches, with Google and Baidu adopting CTC for large‑scale speech services.

Attention mechanisms, originally popularized in neural machine translation (e.g., Google’s GNMT), have recently been applied to speech recognition. At ICASSP 2018, Google demonstrated that attention‑based seq2seq models outperform other architectures on English speech tasks.

In the attention‑based seq2seq framework, speech recognition is treated as a variable‑length audio‑to‑text transformation, jointly learning acoustic and linguistic information. To reduce latency for real‑time use, a Neural Transducer approach splits the audio stream into fixed‑length blocks (e.g., 300 ms) and decodes each block incrementally.

The Didi system adopts the Listen‑Attend‑Spell (LAS) architecture, originally proposed by William Chan et al. LAS consists of three components:

Listener (Encoder) : maps the input feature sequence X = {x₁,…,xₜ} to a high‑level representation h⁽enc⁾ using multi‑layer RNNs.

Attender : learns alignment between h⁽enc⁾ and the output token sequence Y = {y₁,…,yₙ}.

Speller (Decoder) : generates the output token distribution conditioned on previous predictions, the attender output, and its own state.

Key engineering tricks used in training include scheduled sampling, label smoothing, and multi‑head attention. Didi found that using approximately 5,000 common Chinese characters as modeling units yields significantly better performance than phoneme‑based CTC systems.

During decoding, a small beam size (e.g., 4–8) suffices for the LAS model, compared to thousands of paths retained in traditional HMM or CTC decoders. Moreover, the optimal weight for an external N‑gram language model is low (0.1–0.3), whereas HMM‑based systems typically require weights of 10–20.

Performance results reported by Didi:

Non‑real‑time LAS model: ~25% relative improvement over the previous system.

Real‑time Neural Transducer model: ~15% relative improvement.

Model size: LAS is about 1/5 the size of the CTC baseline.

Decoding speed: latency reduced to ¼ of the original, QPS increased fourfold, and server count reduced by ~75%.

The article concludes that attention‑based end‑to‑end models not only boost accuracy but also simplify the training pipeline by eliminating the need for separate acoustic, pronunciation, and language model components.

References

[1] G. Dahl et al., “Context‑Dependent Pre‑trained Deep Neural Networks for Large Vocabulary Speech Recognition,” IEEE TASLP, 2012. [2] H. Sak et al., “Long Short‑term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling,” INTERSPEECH, 2014. [3] Y. Wu et al., “Google’s Neural Machine Translation System,” arXiv:1609.08144, 2016. [4] C. Chiu et al., “State‑of‑the‑art Speech Recognition with Sequence‑to‑Sequence Models,” ICASSP, 2018. [5] T. Sainath et al., “Improving the Performance of Online Neural Transducer Models,” arXiv:1712.01807, 2017. [6] W. Chan et al., “Listen, Attend and Spell,” ICASSP, 2016. [7] W. Zou et al., “A Comparable Study of Modeling Units for End‑to‑End Mandarin Speech Recognition,” arXiv:1805.03832, 2018.

neural networksAttentionSpeech Recognitionend-to-endLASMandarin
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.