Design and Optimization of a Kaldi‑Based Speech Recognition Backend at 58.com
This article details the evolution from the initial Kaldi‑based speech recognition architecture (version 1.0) to a re‑engineered version 2.0, describing business background, service components, identified shortcomings, and a series of performance, concurrency, GPU, I/O, GC, and dispatch optimizations that dramatically improve resource utilization, latency, and reliability for large‑scale voice processing at 58.com.
Speech recognition converts audio signals into text, and two main approaches dominate today: traditional Kaldi‑based pipelines and modern end‑to‑end deep‑learning models such as ESPnet and WeNet. Kaldi offers a comprehensive framework but lacks native support for popular deep‑learning ecosystems, while end‑to‑end models provide better performance and easier deployment.
58.com initially built a self‑developed speech recognition engine on Kaldi (architecture 1.0) to replace costly third‑party services. The early system comprised gateway, audio parsing, Kaldi decoding, silence detection, speaker separation, and post‑processing services, but suffered from high resource consumption, uneven utilization, latency, and limited reliability.
To address these issues, the team launched architecture 2.0, focusing on two major directions: (1) enhancing the Kaldi decoding service for concurrency and GPU support, and (2) refactoring backend services for better scalability. New components such as message scheduling, data reporting, and compensation services were added, while existing services were split and optimized.
Kaldi decoding optimizations included wrapping the decoder as a gRPC service, introducing a synchronized decoder pool for concurrent processing, and enabling CUDA‑accelerated decoding with careful resource binding and thread‑safe callbacks. The optimal number of decoders was determined empirically based on latency targets and hardware capabilities.
Backend service improvements covered multi‑level caching to decouple processing stages, extensive I/O reductions via caching and batch interfaces, migration from G1 to ZGC garbage collector to cut stop‑the‑world pauses, and load‑aware message dispatch that balances traffic across machines. Asynchronous handling of high‑cost modules further lowered response times.
Performance results after the migration showed GPU utilization rising from ~45% to ~75%, GPU resource consumption dropping by 62%, average latency decreasing by 88%, and the 99.9th percentile latency (TP999) improving by 98%.
The article concludes that the rapid rollout of architecture 1.0 met early business needs, and the systematic redesign to architecture 2.0 delivered substantial gains in efficiency, scalability, and reliability for large‑scale speech recognition at 58.com.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.