Artificial Intelligence 11 min read

Voice Robot Sound Classification: Feature Extraction, VGGish Model, and Optimization Experiments

This article describes the end‑to‑end pipeline of a voice robot, covering speech framing, feature extraction (FBank, MFCC), the VGGish embedding network, various model architectures, experimental results on accuracy and recall, and future directions for improving sound‑type classification.

58 Tech
58 Tech
58 Tech
Voice Robot Sound Classification: Feature Extraction, VGGish Model, and Optimization Experiments

The voice robot developed by 58.com’s AI Lab performs automatic dialing, multi‑turn voice interaction, and intent detection, but noisy or unintelligible speech degrades ASR accuracy; therefore a sound‑type classification module filters out unclear audio to improve intent recognition.

After VAD segmentation, the audio is converted to text via ASR, then processed by NLU and a dialogue manager; unclear speech is identified by a classification model whose current accuracy is about 92% and recall around 77%.

Sound feature extraction begins with framing and windowing, followed by short‑time Fourier transform (STFT) to obtain magnitude spectra. A Mel filter bank is applied, logarithm taken to produce FBank features, and finally a discrete cosine transform (DCT) yields MFCC coefficients (typically 2‑13, discarding the 0th DC term).

FBank features retain more high‑frequency detail than MFCC, leading to higher classification accuracy in the robot’s use case.

The VGGish model, a 128‑dimensional embedding network pretrained on YouTube audio, is used to encode each frame’s acoustic features. The initial architecture feeds FBank features into VGGish, then a Bi‑LSTM with attention and a softmax classifier.

Experiments show that using FBank instead of MFCC raises accuracy dramatically, while switching to MFCC reduces accuracy by about 60% due to loss of high‑frequency details crucial for distinguishing clear versus unclear speech.

To improve recall, the team added frame‑wise weighting (spectral arithmetic mean divided by geometric mean) and retrained the network with the weighted FBank features, achieving modest gains.

Further attempts included removing the Bi‑LSTM and attention layers, training a VGGish‑only classifier, and fine‑tuning on in‑domain telephone audio; this raised recall by 38% compared to the generic pretrained model.

Re‑adding the Bi‑LSTM and attention layers yielded only a ~1% recall increase, indicating most discriminative information is captured in the early convolutional layers.

Adding a small amount of newly labeled in‑domain data (≈10% of the original training set) provided an additional 0.5 percentage‑point recall improvement.

Future work will explore combining FBank, MFCC, and delta features, enhancing VGGish with batch normalization, global pooling, and reduced fully‑connected layers, and further network‑level optimizations.

deep learningMFCCSpeech Recognitionaudio feature extractionFBankVGGishvoice classification
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.