Didi Voice Interaction: ASR Error Correction, Intent Classification, and NER Techniques
This article presents Didi's voice interaction platform, detailing the natural language understanding pipeline, ASR error correction methods, intent classification strategies, and named entity recognition models, while discussing practical deployments, performance gains, and future research directions.
Didi's voice interaction system aims to make travel safer and more convenient by leveraging speech as a natural, hands‑free interface. The overall architecture integrates automatic speech recognition (ASR), error correction, intent classification, and downstream dialogue processing.
ASR Error Correction (ASR‑纠错) addresses substitution, deletion, and insertion errors, focusing on phonetic confusions. Traditional pipelines consist of error detection, candidate recall, and ranking. Error detection uses an AC automaton built from confusion sets and keyword syllable patterns. Candidate recall combines high‑order pinyin n‑gram queries with fallback to lower‑order keys, ensuring sufficient suggestions. Ranking converts a recurrent neural network language model (RNNLM) to a weighted finite‑state transducer (WFST) and applies Viterbi decoding, achieving a 2 % character‑accuracy boost in production.
Intent Classification (意图分类) is tackled with two paradigms: end‑to‑end classifiers (FastText, TextCNN, BERT‑classifier) for task‑oriented intents, and retrieval‑based methods (BM25, DSSM, Siamese‑LSTM, Sentence‑BERT) for question‑answering scenarios. The retrieval pipeline first uses representation‑based similarity (pre‑trained embeddings, BERT adaptation with CLS/Mean/Max pooling) for fast recall, then interaction‑based Siamese‑BERT models with regression or classification objectives for precise re‑ranking.
Named Entity Recognition (NER) employs Bi‑GRU‑CRF and BERT‑CRF models. The CRF layer refines token‑level predictions by learning label transition probabilities, improving F1 scores from ~84 % to ~92 % with BERT‑CRF while noting higher computational costs.
Practical deployments include Didi driver voice assistants that now support temperature setting and weather queries with a 1 % absolute character‑accuracy increase and a 7.24 % rise in interaction success rate. Future work proposes leveraging richer ASR outputs (e.g., lattices), incorporating acoustic and user‑specific features into ranking, and exploring end‑to‑end seq2seq or BERT‑based correction models.
The article also references the open‑source Delta platform (https://github.com/didi/delta) that streamlines training, testing, and deployment of speech‑semantic algorithms, and lists several academic papers underpinning the described techniques.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.