Self-Learning Platform for Speech Recognition Model Optimization at DiDi
DiDi’s self‑learning ASR platform lets non‑technical users upload business data, automatically train, test and deploy models with semi‑supervised learning, hot‑word updates and LSTM rescoring, creating a closed‑loop pipeline that boosted vehicle voice‑interaction accuracy from around 80 % to over 95 % within months.
In the connected‑vehicle era, visual and tactile interactions are heavily constrained, making voice interaction the most natural way to control a car. DiDi leverages voice assistants on vehicle head‑units and smartphones to enable hands‑free vehicle control, information query, navigation, and entertainment, thereby reducing driver distraction and improving safety.
To accelerate and stabilize the improvement of speech‑recognition accuracy, DiDi built a self‑learning platform for ASR models. The platform allows non‑technical users to participate in model optimization and enables business data to flow back into the training pipeline, achieving a closed‑loop iteration of models.
Business Background : As data volume, computing power, and deep‑learning techniques advance, ASR accuracy continuously improves and its application scope expands. Interactive scenarios (e.g., voice assistants on car‑machines or phones) require high accuracy, while non‑interactive uses (e.g., trip recording for safety, customer‑service quality inspection) also benefit from reliable transcription.
Although commercial ASR services can reach ~95% word‑accuracy on generic tasks, they often fall short on domain‑specific vocabularies (proper nouns, technical terms). Manual model tuning by a few engineers can take weeks, which is too slow for urgent business needs.
Platform Architecture : The platform consists of a web UI (NodeJS/Antd) and RESTful APIs (Django‑REST‑Framework) that decouple front‑end and back‑end. Users can upload text corpora or pull data from business warehouses (e.g., Hive) to trigger automated training, testing, and deployment pipelines without restarting the ASR service.
Key backend features include:
Configuration persistence and fine‑grained permission control via ORM‑based storage (MongoDB/MySQL), supporting user‑project‑resource hierarchies and operation logging for auditability.
Task scheduling and asynchronous processing using Celery and RabbitMQ: short‑latency sync tasks are handled immediately, while long‑running training or inference tasks run asynchronously, with periodic tasks for regular model updates.
Algorithmic Applications :
Acoustic Model Optimization : To reduce dependence on large labeled datasets, the platform adopts semi‑supervised learning (SSL). Unlabeled audio is pseudo‑labeled by existing models and fed back into training, improving robustness with minimal manual annotation.
Data Recall Module : Multiple recall models and a discriminative strategy select high‑quality audio and generate pseudo‑labels, ensuring diverse and reliable training samples.
Model Training Module : Iterative training cycles (weekly or bi‑weekly) incorporate recalled data, with on‑the‑fly decoding to evaluate model quality. The best models are archived and become the base for the next iteration.
Language Model Optimization :
Hot‑word model with hot‑update capability: a small domain‑specific LM is merged with a large generic LM, allowing rapid incorporation of new keywords.
Rescoring using an LSTM‑based language model: N‑best ASR hypotheses are re‑ranked based on LSTM scores, yielding a ~5% accuracy gain over pure beam‑search selection.
Text data back‑flow: Periodic jobs pull high‑confidence transcriptions from production to continuously fine‑tune the LM, maintaining robustness to evolving business scenarios.
Product Deployment : In the DiDi Kuā project, weekly back‑flow from MySQL improved ASR word‑accuracy from 80% to 90% after 2–3 months. In the D1 custom‑car project, iterative post‑processing and self‑training raised voice‑interaction success rates from 80% to over 95%.
The platform demonstrates how a self‑learning ASR system can rapidly adapt to domain‑specific vocabularies, reduce reliance on scarce labeled data, and deliver measurable improvements in real‑world voice‑interaction products.
Didi Tech
Official Didi technology account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.