Artificial Intelligence 21 min read

Didi Speech Interaction: ASR Error Correction, Intent Classification, and NER Techniques

Didi’s voice‑interaction platform combines a three‑stage ASR error‑correction pipeline, optimized intent‑classification models (both end‑to‑end and retrieval‑based), and advanced Chinese NER using Bi‑GRU‑CRF and BERT‑CRF, boosting transcription accuracy and overall dialogue success while supporting scalable deployment and future enhancements such as lattice inputs and richer acoustic signals.

Didi Tech
Didi Tech
Didi Tech
Didi Speech Interaction: ASR Error Correction, Intent Classification, and NER Techniques

Didi aims to make travel safer and more convenient through voice interaction. Voice interaction relies on natural language understanding (NLU) to correctly interpret user utterances. This article introduces Didi's exploration and practice of NLU technologies in speech interaction.

Overall Algorithm Framework The speech interaction system consists of multiple modules, including an interaction engine and an ASR correction engine. The overall pipeline (Figure 1) processes voice input, performs ASR, applies error correction, and feeds the corrected text into downstream NLU components.

2. ASR Error Correction

2.1 Overview ASR results often contain substitution, deletion, and insertion errors, with substitution errors accounting for 70‑80% of mistakes. Correcting these errors improves semantic understanding and downstream dialogue success rates.

2.1.2 Research Status Recent end‑to‑end models (e.g., soft‑masked BERT, GCN‑enhanced models) achieve strong performance but require large high‑quality training data and cannot be updated instantly. In production, Didi adopts a traditional three‑stage correction framework: error detection, candidate recall, and candidate ranking.

2.2 Technical Solution

2.2.1 Basic Scheme The correction system comprises three modules: error detection, candidate recall, and candidate ranking. The final output is used by downstream pipelines.

2.2.2 Error Detection Detection focuses on phonetic substitution errors. A confusion set is built from aligned ASR‑reference pairs, and an Aho‑Corasick (AC) automaton is constructed using keyword syllable sequences. The AC automaton efficiently matches input syllable streams to retrieve hot‑words.

2.2.3 Candidate Recall Candidates are recalled by matching the input syllable sequence against leaf nodes of the AC automaton. Additionally, an n‑gram pinyin fallback mechanism ensures recall coverage for rare deletion errors.

2.2.4 Candidate Ranking RNN‑LM (converted to WFST) or n‑gram language models score candidate sentences. Viterbi decoding selects the best hypothesis, yielding a 2% absolute character‑accuracy gain in intelligent客服 projects.

2.3 Applications The correction system has been deployed in multiple Didi services, such as the driver voice assistant, improving character accuracy by 1% and overall interaction success by 7.24%.

2.4 Future Outlook Future work includes leveraging richer ASR outputs (e.g., lattices), incorporating additional signals (acoustic, user profile) for ranking, and exploring end‑to‑end approaches like BERT‑based seq2seq correction.

3. Intent Classification

3.1 Overview Two main approaches are discussed: (1) end‑to‑end classifiers (FastText, TextCNN, BERT‑Classifier) suitable for task‑oriented intents, and (2) retrieval‑based methods (BM25, DSSM, Siamese‑LSTM, Sentence‑BERT) for question‑answering intents.

3.2 Technical Solutions

3.2.1 Text Classification For stable task‑oriented scenarios, Didi uses end‑to‑end models. TextCNN is optimized by treating whole entities as tokens, adding pinyin embeddings, combining pretrained and random character embeddings, and using k‑max‑pooling.

3.2.2 Text Matching Retrieval‑based intent classification consists of recall and ranking. Recall uses lexical similarity (edit distance, TF‑IDF, BM25) and representation‑based methods (sentence embeddings from BERT with CLS/Mean/Max pooling). Ranking employs interaction‑based Siamese‑BERT models with regression or classification objectives.

4. Named Entity Recognition (NER)

Didi employs Bi‑GRU‑CRF and BERT‑CRF models for Chinese NER. Bi‑GRU‑CRF captures forward and backward context, while CRF models label transitions. BERT‑CRF replaces Bi‑GRU with pretrained BERT, achieving higher F1 scores with limited data.

5. Summary NLP algorithms have broad applications in Didi's speech interaction scenarios. Besides the modules described, other techniques such as dialogue management, word segmentation, coreference resolution, data augmentation, and clustering are also employed. Didi’s open‑source Delta platform (https://github.com/didi/delta) supports end‑to‑end training, testing, and deployment of these AI models.

References [1] Zhang et al., “Spelling Error Correction with Soft‑Masked BERT”, 2020. [2] Cheng et al., “SpellGCN: Incorporating Phonological and Visual Similarities into Language Models for Chinese Spelling Check”, 2020. [3] Joulin et al., “Bag of Tricks for Efficient Text Classification”, 2017. [4] Y. Kim, EMNLP 2014. [5] Devlin et al., “BERT: Pre‑training of Deep Bidirectional Transformers for Language Understanding”, 2019. [6] Huang et al., “Learning Deep Structured Semantic Models for Web Search Using Clickthrough Data”, 2013. [7] Mueller & Thyagarajan, “Siamese Recurrent Architectures for Learning Sentence Similarity”, 2016. [8] Reimers & Gurevych, “Sentence‑BERT: Sentence Embeddings Using Siamese BERT‑Networks”, 2019. [9] Jiao et al., “Chinese Lexical Analysis with Deep Bi‑GRU‑CRF Network”, 2018. [10] Yang, “BERT Meets Chinese Word Segmentation”, 2019.

deep learningNatural Language ProcessingASR correctionIntent classificationNamed entity recognitionSpeech interaction
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.