Artificial Intelligence 12 min read

Streaming Speech Recognition Engine: Architecture, Workflow, and Optimizations at 58.com

The article details the design, components, real‑time processing flow, and performance optimizations of 58.com’s streaming speech recognition engine, covering its SDK access layer, logical services, data storage, Kaldi‑based decoding, and the practical impact on voice‑driven applications.

58 Tech
58 Tech
58 Tech
Streaming Speech Recognition Engine: Architecture, Workflow, and Optimizations at 58.com

Background Speech is a crucial communication medium for 58.com users, generating massive audio data that can be converted to text via speech recognition; streaming recognition enables real‑time transcription, supporting interactive voice services for both B‑end merchants and C‑end users.

Overall Architecture The engine consists of three layers: (1) Access layer with iOS/Android/Java SDKs establishing full‑duplex connections; (2) Logic layer including voice access service, silence detection, and real‑time decoding, built on the open‑source Kaldi framework and ABTest for model comparison; (3) Data layer storing configurations in MySQL, audio files in the proprietary WOS middleware, transcriptions in KV store WTable, and analytics in Hive.

Streaming Recognition Process The workflow follows four stages: handshake/authentication, recognition start (model selection via ABTest), recognition in progress (voice activity detection triggers real‑time decoding, intermediate results returned to the client), and recognition end (final result returned, resources released).

Access Layer SDK The SDK handles authentication, connection establishment, event processing, and callback invocation for different recognition states, enabling the client to receive start, progress, and end notifications.

Core Functions of the Engine Core services include voice access (interaction) service, silence detection, and real‑time decoding. The voice access service manages protocol exchange with the SDK, forwards audio streams to the decoder, and performs post‑processing such as punctuation.

Voice Access (Interaction) Service Provides five key capabilities: (1) authentication via signed keys, (2) concurrency limiting per business unit, (3) event and callback handling for various recognition states, (4) bidirectional data stream handling, and (5) post‑processing to add punctuation. It also invokes ABTest and silence detection services, with optional degradation paths under high load.

Real‑Time Decoding Service Implements low‑latency transcription using Kaldi‑trained acoustic and language models. At initialization, multiple decoder instances are loaded into a synchronized pool; each request grabs a decoder, extracts audio features, and performs fast lattice search to produce intermediate and final text results.

Performance Optimizations To meet concurrency, latency, and accuracy requirements, the team increased the number of decoder instances, tuned beam and lattice‑beam parameters, reduced memory fragmentation, and optimized decoding paths. These changes cut average decoding time from 567 ms to 97 ms while preserving accuracy, and reduced intermediate result latency from ~104 ms to ~18 ms regardless of audio chunk size.

Conclusion The streaming speech recognition engine has fully replaced third‑party services in 58.com’s voice bots and intelligent Q&A scenarios, delivering real‑time, high‑accuracy transcription. Future work will explore additional application types and further performance enhancements.

architectureAIstreamingSpeech RecognitionKaldireal-time decoding
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.