Artificial Intelligence 14 min read

Iterative Optimization of Voice Endpoint Detection for Voice Robots: From Dual‑Threshold to WebRTC VAD and VADNet

This article details the evolution of the voice endpoint detection (VAD) module in 58.com’s voice robot, comparing a dual‑threshold method, Google’s WebRTC VAD, and the deep‑learning based VADNet, and presents experimental results on accuracy, recall, F1 score and online latency.

58 Tech

Nov 16, 2020

Iterative Optimization of Voice Endpoint Detection for Voice Robots: From Dual‑Threshold to WebRTC VAD and VADNet

The voice robot developed by 58.com’s TEG AI Lab includes an automatic telephone dialing, multi‑turn voice interaction, and intelligent intent judgment module, where the voice endpoint detection (VAD) component identifies the start and end of human speech within noisy audio streams.

VAD is crucial for extracting speech segments for recognition; its performance directly impacts the smoothness of dialogue and user experience.

Dual‑Threshold VAD : Treats VAD as a binary classification task on short frames (20‑25 ms). It uses short‑time zero‑crossing rate (ZCR) to detect voiceless sounds and short‑time energy (STE) for voiced sounds. When both ZCR and STE exceed their upper thresholds, a speech segment starts; when either falls below its lower threshold, the segment ends. The method’s thresholds are denoted as zcr_high, zcr_low, energy_high, and energy_low.

The dual‑threshold approach yields high recall (≈97 %) but lower precision, resulting in an F1 of 52.28 %.

备注

正向准确率

正向召回率

正向F1值

双门限 VAD

35.73%

97.37%

52.28%

WebRTC VAD : Based on Gaussian Mixture Models (GMM) that model speech and noise in six sub‑bands. It computes log‑likelihood ratios per band and a weighted overall ratio; if any exceed a threshold, the frame is classified as speech. The model updates its parameters online depending on the current frame’s classification.

备注

正向准确率

正向召回率

正向F1值

WebRTC VAD

54.75%

97.36%

70.09%

WebRTC VAD improves both precision and recall compared with the dual‑threshold method, raising the F1 by about 20 %.

VADNet Exploration : VADNet is a CRNN‑based deep‑learning model that takes either raw waveforms or acoustic features (MFCC, FBank) as input and outputs a binary speech/non‑speech decision. Experiments compared three input types:

输入

正向准确率

正向召回率

正向F1值

FEC

OVER

MSC

NDS

MFCC

79.76%

91.36%

85.17%

7.20%

0.37%

6.60%

4.04%

FBank

71.47%

94.06%

81.22%

4.72%

0.64%

5.06%

6.20%

声音波形

80.87%

91.85%

86.01%

6.81%

0.35%

6.47%

3.71%

Raw waveform input achieved the highest F1, so subsequent experiments used waveform as input.

Online deployment considerations include frame length and prediction latency. Various configurations were tested:

备注

帧长

在线预测耗时

正向准确率

正向召回率

正向F1值

帧长500ms+conv3+2层RNN

500 ms

120 ms/帧

80.87%

91.85%

86.01%

帧长200ms+conv4+1层RNN

200 ms

26 ms/帧

75.78%

92.98%

83.51%

帧长200ms+conv4+无RNN

200 ms

16 ms/帧

58.37%

85.11%

69.25%

The production system currently uses a 200 ms frame length with a conv4 + 1‑layer RNN architecture, sliding a 200 ms window with a 100 ms hop.

Overall comparison shows VADNet achieves the highest forward F1 (≈83.5 %) compared with WebRTC VAD (≈70 %) and dual‑threshold VAD (≈52 %). VADNet’s per‑frame inference time is about 26 ms, leaving room for further optimization.

Conclusion : The article reviews the iterative optimization of the voice endpoint detection module in the voice robot, covering dual‑threshold, WebRTC VAD, and VADNet methods, and outlines future work to improve feature extraction, segmentation strategies, latency, and F1 performance.

The VAD module currently supports the voice robot reliably, but further enhancements are planned to achieve faster and more accurate speech endpoint detection.

Department Introduction : 58.com TEG AI Lab focuses on applying AI technologies across the platform, delivering products such as intelligent customer service, voice robots, automated writing, speech analysis, and AI algorithm platforms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Real‑time communication Voice Activity Detection VAD speech processing

Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.