Iterative Optimization of Voice Endpoint Detection for Voice Robots: From Dual‑Threshold to WebRTC VAD and VADNet
This article details the evolution of the voice endpoint detection (VAD) module in 58.com’s voice robot, comparing a dual‑threshold method, Google’s WebRTC VAD, and the deep‑learning based VADNet, and presents experimental results on accuracy, recall, F1 score and online latency.
The voice robot developed by 58.com’s TEG AI Lab includes an automatic telephone dialing, multi‑turn voice interaction, and intelligent intent judgment module, where the voice endpoint detection (VAD) component identifies the start and end of human speech within noisy audio streams.
VAD is crucial for extracting speech segments for recognition; its performance directly impacts the smoothness of dialogue and user experience.
Dual‑Threshold VAD : Treats VAD as a binary classification task on short frames (20‑25 ms). It uses short‑time zero‑crossing rate (ZCR) to detect voiceless sounds and short‑time energy (STE) for voiced sounds. When both ZCR and STE exceed their upper thresholds, a speech segment starts; when either falls below its lower threshold, the segment ends. The method’s thresholds are denoted as zcr_high , zcr_low , energy_high , and energy_low .
The dual‑threshold approach yields high recall (≈97 %) but lower precision, resulting in an F1 of 52.28 %.
备注
正向准确率
正向召回率
正向F1值
双门限 VAD
35.73%
97.37%
52.28%
WebRTC VAD : Based on Gaussian Mixture Models (GMM) that model speech and noise in six sub‑bands. It computes log‑likelihood ratios per band and a weighted overall ratio; if any exceed a threshold, the frame is classified as speech. The model updates its parameters online depending on the current frame’s classification.
备注
正向准确率
正向召回率
正向F1值
WebRTC VAD
54.75%
97.36%
70.09%
WebRTC VAD improves both precision and recall compared with the dual‑threshold method, raising the F1 by about 20 %.
VADNet Exploration : VADNet is a CRNN‑based deep‑learning model that takes either raw waveforms or acoustic features (MFCC, FBank) as input and outputs a binary speech/non‑speech decision. Experiments compared three input types:
输入
正向准确率
正向召回率
正向F1值
FEC
OVER
MSC
NDS
MFCC
79.76%
91.36%
85.17%
7.20%
0.37%
6.60%
4.04%
FBank
71.47%
94.06%
81.22%
4.72%
0.64%
5.06%
6.20%
声音波形
80.87%
91.85%
86.01%
6.81%
0.35%
6.47%
3.71%
Raw waveform input achieved the highest F1, so subsequent experiments used waveform as input.
Online deployment considerations include frame length and prediction latency. Various configurations were tested:
备注
帧长
在线预测耗时
正向准确率
正向召回率
正向F1值
帧长500ms+conv3+2层RNN
500 ms
120 ms/帧
80.87%
91.85%
86.01%
帧长200ms+conv4+1层RNN
200 ms
26 ms/帧
75.78%
92.98%
83.51%
帧长200ms+conv4+无RNN
200 ms
16 ms/帧
58.37%
85.11%
69.25%
The production system currently uses a 200 ms frame length with a conv4 + 1‑layer RNN architecture, sliding a 200 ms window with a 100 ms hop.
Overall comparison shows VADNet achieves the highest forward F1 (≈83.5 %) compared with WebRTC VAD (≈70 %) and dual‑threshold VAD (≈52 %). VADNet’s per‑frame inference time is about 26 ms, leaving room for further optimization.
Conclusion : The article reviews the iterative optimization of the voice endpoint detection module in the voice robot, covering dual‑threshold, WebRTC VAD, and VADNet methods, and outlines future work to improve feature extraction, segmentation strategies, latency, and F1 performance.
The VAD module currently supports the voice robot reliably, but further enhancements are planned to achieve faster and more accurate speech endpoint detection.
Department Introduction : 58.com TEG AI Lab focuses on applying AI technologies across the platform, delivering products such as intelligent customer service, voice robots, automated writing, speech analysis, and AI algorithm platforms.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.