Building a Self‑Developed Speech Recognition Engine at 58.com: From Team Formation to Production Deployment
This article details how a three‑person team at 58.com built a self‑developed speech recognition engine in less than a year, covering background, team formation, data annotation, model selection, engineering architecture, performance optimizations, deployment results, and future directions.
Most companies purchase third‑party speech recognition services, but the high cost and inflexibility prompted 58.com to develop its own engine. Starting in November 2019 with two algorithm engineers and one backend engineer, the team built a system that eventually outperformed external vendors.
Background : Since 2018, 58.com deployed various voice‑based products (outbound robots, quality‑inspection, analysis platforms) using third‑party APIs for short‑phrase, real‑time, and file‑based recognition. Growing data volume (millions of hours) made external services prohibitively expensive.
Team Building : Initial recruitment was difficult; the team instead reallocated an NLP engineer, a backend engineer, and hired a junior engineer with basic speech knowledge, forming a three‑person core team.
Data Annotation : High‑quality labeled audio (>98% accuracy) is essential. The team built a Wavesurfer.js‑based annotation platform, assembled a four‑person labeling crew, and outsourced large‑scale labeling to a third‑party vendor, completing 3,500 hours of annotated data by June 2020.
Algorithm Models : The engine uses the Kaldi framework with a Chain Model (HMM‑DNN) as the primary acoustic model, while also experimenting with ESPnet end‑to‑end models (Transformer + CTC). Language models, G2P conversion, and custom pronunciation dictionaries were built for domain‑specific vocabularies.
Engineering Architecture : Services are containerized with Docker and exposed via gRPC/HTTP. Two main services exist: a batch file‑recognition service (using nnet3‑latgen‑faster‑parallel on CPU and batched‑wav‑nnet3‑cuda on GPU) and a real‑time streaming service (online2‑wav‑nnet3‑latgen‑faster). Both share preprocessing, VAD, speaker diarization, decoding, and post‑processing modules.
VAD & SD Optimizations : Replaced webrtc VAD with a two‑layer LSTM VAD and a ResNet‑based speaker embedding model, reducing diarization error rate by 10 %. Implemented segmented delayed‑prediction LSTM with cuDNN acceleration for long recordings.
Decoder Optimizations : Modified Kaldi decoders for streaming, reduced search space, switched memory allocator to tcmalloc, and introduced fast intermediate‑result extraction, cutting per‑segment latency from 104 ms to 18 ms and overall streaming latency from 567 ms to 97 ms.
Deployment & Results : By July 2020 the self‑built engine replaced third‑party services in both batch and real‑time scenarios, handling millions of hours of audio with comparable or better accuracy and significantly lower cost.
Future Work : Continue improving multi‑scenario, multi‑dialect performance, explore speaker verification, speech synthesis, and expand the team to support broader voice technologies.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.