Artificial Intelligence 20 min read

Intelligent Voice Quality Inspection System: Architecture, Core Technologies, and Business Cases

This article presents 58.com’s intelligent voice quality inspection system, detailing its overall architecture, speech separation, speaker role identification, NLP‑based tagging, model choices such as VGG, BERT, ALBERT and SPTM, and real‑world call‑center use cases that improve efficiency and reduce risk.

58 Tech
58 Tech
58 Tech
Intelligent Voice Quality Inspection System: Architecture, Core Technologies, and Business Cases

Guide

58.com employs thousands of sales and customer‑service agents who make millions of hours of phone calls each year. To standardize work and improve service quality, the company built an intelligent voice quality inspection system that converts speech to text via ASR and then applies NLP techniques for automatic quality assessment.

Guest Introduction

Chen Lu, senior AI Lab algorithm engineer at 58.com, joined in September 2018 and focuses on voice quality inspection and voice‑bot algorithms. She holds a master’s degree from Beijing University of Posts and Telecommunications and previously worked at JD.com on product‑review mining.

Background

Traditional voice quality inspection relies on human reviewers listening to a small sample of recordings, which suffers from low coverage, inconsistent standards, missed risk information, and high labor cost.

Our intelligent system captures recordings in real time, transcribes them, runs a quality‑inspection model, and displays results on a web platform where reviewers can perform a second check and feed back to supervisors. This machine‑plus‑human workflow offers full‑volume inspection, real‑time feedback, precise risk identification, and minimal manual review.

Overall Architecture

The core is the logic layer, which includes speaker role recognition, semantic tagging, and speech scoring. The access layer feeds raw audio to the base‑service layer for speech separation and ASR; the logic layer then performs quality inspection and sends results to the web platform for visualization and further annotation.

Core Technologies

Speech Separation & Recognition

Because 58’s outbound calls are recorded on a single channel, speaker voices overlap. We first separate the two speakers using a diarization pipeline (VAD → vectorization → k‑means clustering) and then run ASR on each stream. Separation quality is measured by DER, recognition quality by CER.

VAD uses Google’s webrtcvad, audio‑to‑vector conversion employs a 34‑layer VGG‑ResNet, and clustering is performed with k‑means.

Speaker Role Identification

After transcription, we first perform gender detection (VGGish + Bi‑LSTM + Attention, 92% accuracy). If genders differ, the known gender of the agent resolves the roles; if genders are the same, we use a prior that the agent speaks more, followed by a sentence‑level correction model (two‑layer BERT) to fix mis‑assigned utterances.

Quality‑Inspection Algorithms

We treat the detection of non‑standard or risky utterances as a sentence‑level classification task. For the sales line, we first used TextCNN because it handles short, local patterns well and is robust to ASR errors. Later we introduced a custom Simple Pre‑trained Model (SPTM) and a two‑layer ALBERT, both achieving comparable accuracy to BERT‑Base with far lower latency.

Global tags are also supported by leveraging context to avoid false positives (e.g., detecting “push‑back” behavior in a multi‑turn dialogue).

Rule Mining

We extract N‑gram rules and discover new words using pointwise mutual information and entropy, which helps identify patterns strongly associated with risky or abusive utterances.

Business Cases

1. **High‑Risk Sales Calls** – Tags such as “complaint”, “abuse”, and “over‑promise” are detected, enabling supervisors to quarantine offending agents or place numbers in a “silence” pool for 180 days.

2. **Call‑Center Risk Control** – Historical labeled calls are used to predict future high‑risk calls, reducing customer complaints through proactive protection.

3. **Customer‑Service Quality** – Mandatory tags (e.g., “opening greeting”, “responsible‑person confirmation”) are enforced, and prohibited tags (e.g., “push‑back”, “no closing”) are flagged.

Open‑Source Projects

We also released two AI‑Lab projects: qa_match , a deep‑learning based question‑answer matching tool that includes the SPTM model, and dl_inference , a high‑throughput inference engine supporting TensorFlow, PyTorch, and Caffe models for billions of daily requests.

Overall, the system processes millions of hours of call audio, achieves ~92% inspection accuracy, saves over a thousand person‑hours per day, and continuously improves through human‑in‑the‑loop feedback.

machine learningAINLPcall centerspeech processingvoice quality inspection
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.