How HuoLaLa Built a Custom ASR System to Boost Accuracy and Cut Costs
This article details HuoLaLa's development of an in‑house Automatic Speech Recognition system, covering its architecture, VAD optimization, language‑model and hot‑word enhancements, punctuation restoration, task and resource scheduling, and the resulting improvements in accuracy and cost efficiency.
Background
HuoLaLa generates over 300,000 hours of audio daily, including call recordings, trip recordings, and real‑time audio. To make this data usable for downstream services such as governance, security, risk control, and customer support, speech must be transcribed via Automatic Speech Recognition (ASR). Existing third‑party ASR services suffer from poor domain accuracy, high procurement costs, and lack of custom development capabilities.
Solution
We decided to develop a self‑built ASR system, which delivers higher recognition accuracy for freight‑specific terminology, eliminates fixed procurement expenses, and builds core technology capabilities for future scenarios.
The service architecture consists of four layers:
Data layer: sources of audio recordings.
Application layer: ASR use cases such as risk control and intelligent customer service.
Platform layer: core ASR services (real‑time and offline) and the internal ML platform for model inference.
Infrastructure layer: databases, caches, message queues, object storage, etc.
ASR Algorithm Implementation
ASR includes VAD, acoustic model, language model, hot‑word enhancement, attention decoder, beam search, and punctuation restoration. The overall processing flow is illustrated below.
Effective Audio Detection (VAD)
We use Voice Activity Detection (VAD) to filter out non‑speech segments, improving both recognition quality and speed. After evaluating WebRTC‑VAD, Silero‑VAD, and LSTM‑VAD, we selected WebRTC‑VAD for its superior performance in noisy environments.
The VAD model splits the signal into six sub‑bands, models each with a Gaussian mixture (speech and noise), computes log‑likelihood ratios, and decides speech presence based on thresholds.
Encoding and Decoding
We adopt the open‑source Wenet model for the encoder‑decoder pipeline. Audio is encoded into feature vectors, decoded by CTC to produce a probability matrix, then refined using language model, hot‑word enhancement, beam search, and an attention decoder to obtain the final transcription.
Training a domain‑specific language model with business data reduced the character error rate (CER) from 0.406 to 0.347.
Language Model
We employ an n‑gram language model (bigram in practice) to estimate sentence probabilities, further lowering CER to 0.314 after integration with other modules.
Hot‑word Enhancement
Domain‑specific terms such as “尾板” and “高栏” are boosted by adding them to the language model with higher weights, raising hot‑word recognition from 0.559 to 0.718.
Punctuation Model
Initially we used PaddleSpeech models (ernie_linear_p3_wudao) for punctuation restoration, but switched to a lightweight LSTM model framed as a sequence labeling task (O, S‑Comma, S‑Period, S‑Question). The model achieves 0.88 precision for commas, 0.97 for periods, and 0.92 for question marks.
ASR Task and Resource Scheduling
In HuoLaLa, the core problem is matching ASR tasks with available transcription resources. We designed two scheduling systems: one for task dispatch and another for resource (route) allocation, ensuring high‑throughput, fault tolerance, and efficient utilization.
Routes (model instances) are allocated per audio fragment, kept busy until transcription completes, and released thereafter. The scheduler enforces exclusive route usage, avoids idle routes during peak periods, and includes automatic fault‑recovery mechanisms.
Business Effects
After optimization, the self‑built ASR achieves a CER of 0.314 (vs. 0.282 for third‑party) and a hot‑word recognition rate of 0.888 (vs. 0.691). These improvements enable downstream services such as accurate order‑cancellation responsibility detection.
Conclusion and Outlook
The article presents the end‑to‑end practice of ASR in HuoLaLa, covering VAD optimization, language‑model and hot‑word enhancements, punctuation modeling, and task‑resource scheduling. The custom ASR improves domain accuracy by 19.7% and reduces costs, with future work aimed at further performance gains and broader deployment.
References
ITU, Coding of Speech and 8 kbit/s Using Conjugate Structure Algebraic Code‑Excited Linear Prediction, 1996.
Y. D. Cho, K. Al‑Naimi, and A. Kondoz, “Improved statistical voice activity detection based on a smoothed statistical likelihood ratio,” Proc. IEEE ICASSP’01, 2001.
J. Kim et al., “Vowel based voice activity detection with LSTM recurrent neural network,” 8th Int. Conf. Signal Process. Syst., 2016.
Z. Yao et al., “WeNet: Production oriented streaming and non‑streaming end‑to‑end speech recognition toolkit,” Interspeech 2021.
B. Zhang et al., “Wenet 2.0: More productive end‑to‑end speech recognition toolkit,” arXiv:2203.15455, 2022.
M. Mohri, “Finite‑state transducers in language and speech processing,” Computational Linguistics, 1997.
H. Zhang et al., “PaddleSpeech: An Easy‑to‑Use All‑in‑One Speech Toolkit,” NAACL Demo 2022.
O. Tilk and T. Alumae, “LSTM for punctuation restoration in speech transcripts,” Interspeech 2015.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
