Artificial Intelligence 15 min read

Design and Optimization of a Kaldi‑Based Speech Recognition Backend at 58.com

This article details the evolution from the initial Kaldi‑based speech recognition architecture (version 1.0) to a re‑engineered version 2.0, describing business background, service components, identified shortcomings, and a series of performance, concurrency, GPU, I/O, GC, and dispatch optimizations that dramatically improve resource utilization, latency, and reliability for large‑scale voice processing at 58.com.

58 Tech

Jul 6, 2023

Design and Optimization of a Kaldi‑Based Speech Recognition Backend at 58.com

Speech recognition converts audio signals into text, and two main approaches dominate today: traditional Kaldi‑based pipelines and modern end‑to‑end deep‑learning models such as ESPnet and WeNet. Kaldi offers a comprehensive framework but lacks native support for popular deep‑learning ecosystems, while end‑to‑end models provide better performance and easier deployment.

58.com initially built a self‑developed speech recognition engine on Kaldi (architecture 1.0) to replace costly third‑party services. The early system comprised gateway, audio parsing, Kaldi decoding, silence detection, speaker separation, and post‑processing services, but suffered from high resource consumption, uneven utilization, latency, and limited reliability.

To address these issues, the team launched architecture 2.0, focusing on two major directions: (1) enhancing the Kaldi decoding service for concurrency and GPU support, and (2) refactoring backend services for better scalability. New components such as message scheduling, data reporting, and compensation services were added, while existing services were split and optimized.

Kaldi decoding optimizations included wrapping the decoder as a gRPC service, introducing a synchronized decoder pool for concurrent processing, and enabling CUDA‑accelerated decoding with careful resource binding and thread‑safe callbacks. The optimal number of decoders was determined empirically based on latency targets and hardware capabilities.

Backend service improvements covered multi‑level caching to decouple processing stages, extensive I/O reductions via caching and batch interfaces, migration from G1 to ZGC garbage collector to cut stop‑the‑world pauses, and load‑aware message dispatch that balances traffic across machines. Asynchronous handling of high‑cost modules further lowered response times.

Performance results after the migration showed GPU utilization rising from ~45% to ~75%, GPU resource consumption dropping by 62%, average latency decreasing by 88%, and the 99.9th percentile latency (TP999) improving by 98%.

The article concludes that the rapid rollout of architecture 1.0 met early business needs, and the systematic redesign to architecture 2.0 delivered substantial gains in efficiency, scalability, and reliability for large‑scale speech recognition at 58.com.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Backend Architecture AI GPU speech recognition Kaldi wenet

Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.