Practical Implementation of Voice Activity Detection (VAD) for Streaming and Offline Scenarios at 58.com
This article presents the design, training, deployment, and evaluation of a self‑developed Voice Activity Detection system used in both real‑time streaming dialogues and offline audio analysis at 58.com, detailing algorithm choices, smoothing strategies, engineering challenges, and future research directions.
AI‑driven voice applications have been widely deployed at 58.com, where Voice Activity Detection (VAD) acts as a valve that controls the flow of audio signals and determines subsequent system actions.
The VAD system serves two main scenarios: a streaming case for real‑time human‑machine conversations (e.g., voice robots for sales, service promotion, and content moderation) and an offline case for the "Lingxi" intelligent speech analysis platform that converts recorded audio to text and extracts tags via NLP.
At the algorithm level, the VAD consists of a decision module that performs binary classification (speech vs. non‑speech) using neural networks—specifically a double‑layer LSTM followed by a double‑layer fully‑connected network (LSTM hidden size 64, FC size 32, output dimension 2)—and a post‑processing module that smooths frame‑level decisions with a sliding window defined by parameters {N, T1, T2}. Different window settings (e.g., {5,4,5} vs. {15,10,15}) affect the continuity of detected speech segments.
Training data include 520 hours of 8 kHz telephone audio for the voice‑robot scenario and 440 hours of 16 kHz app‑side audio for the interview scenario. Features are 40‑dimensional MFCCs (25 ms frame length, 10 ms frame shift) stored in TFRecord format. The model is trained with cross‑entropy loss for 20 epochs.
For streaming deployment, the model is served on the internal wpai deep‑learning platform. Audio packets of 20 ms (640 bytes) are buffered, and a request is sent every five packets (≈100 ms) together with the previous LSTM hidden state. Overlapping samples are included to maintain feature continuity.
In the offline scenario, audio is split into 5‑second chunks with a 1‑second overlap, enabling batch inference. The training uses TensorFlow’s CudnnCompatibleLSTMCell, reducing inference time by 34.6 % compared with the standard LSTM implementation.
Evaluation metrics—false alarm rate, miss alarm rate, diarization error rate (DER), and character error rate (CER)—show that the self‑developed VAD outperforms WebRTC VAD in both mono‑channel and stereo‑channel conditions, as illustrated in Tables 1 and 2.
The paper concludes with future work directions: handling low‑SNR and reverberant environments, improving model generalization via domain adaptation, and jointly optimizing VAD with downstream modules such as speaker diarization and speech recognition.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.