Artificial Intelligence 15 min read

Design and Optimization of a Kaldi‑Based Speech Recognition Backend at 58.com

This article details the evolution from the initial Kaldi‑based speech recognition architecture (version 1.0) to a re‑engineered version 2.0, describing business background, service components, identified shortcomings, and a series of performance, concurrency, GPU, I/O, GC, and dispatch optimizations that dramatically improve resource utilization, latency, and reliability for large‑scale voice processing at 58.com.

58 Tech
58 Tech
58 Tech
Design and Optimization of a Kaldi‑Based Speech Recognition Backend at 58.com

Speech recognition converts audio signals into text, and two main approaches dominate today: traditional Kaldi‑based pipelines and modern end‑to‑end deep‑learning models such as ESPnet and WeNet. Kaldi offers a comprehensive framework but lacks native support for popular deep‑learning ecosystems, while end‑to‑end models provide better performance and easier deployment.

58.com initially built a self‑developed speech recognition engine on Kaldi (architecture 1.0) to replace costly third‑party services. The early system comprised gateway, audio parsing, Kaldi decoding, silence detection, speaker separation, and post‑processing services, but suffered from high resource consumption, uneven utilization, latency, and limited reliability.

To address these issues, the team launched architecture 2.0, focusing on two major directions: (1) enhancing the Kaldi decoding service for concurrency and GPU support, and (2) refactoring backend services for better scalability. New components such as message scheduling, data reporting, and compensation services were added, while existing services were split and optimized.

Kaldi decoding optimizations included wrapping the decoder as a gRPC service, introducing a synchronized decoder pool for concurrent processing, and enabling CUDA‑accelerated decoding with careful resource binding and thread‑safe callbacks. The optimal number of decoders was determined empirically based on latency targets and hardware capabilities.

Backend service improvements covered multi‑level caching to decouple processing stages, extensive I/O reductions via caching and batch interfaces, migration from G1 to ZGC garbage collector to cut stop‑the‑world pauses, and load‑aware message dispatch that balances traffic across machines. Asynchronous handling of high‑cost modules further lowered response times.

Performance results after the migration showed GPU utilization rising from ~45% to ~75%, GPU resource consumption dropping by 62%, average latency decreasing by 88%, and the 99.9th percentile latency (TP999) improving by 98%.

The article concludes that the rapid rollout of architecture 1.0 met early business needs, and the systematic redesign to architecture 2.0 delivered substantial gains in efficiency, scalability, and reliability for large‑scale speech recognition at 58.com.

performance optimizationbackend architectureAIGPUSpeech RecognitionKaldiwenet
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.