How 58.com Scales Voice Quality Inspection with AI-Powered Architecture

This article details the AI-driven intelligent voice quality inspection system built by 58.com, covering its background, multi‑layer architecture, speech recognition, role and tag identification, backend services, and the resulting efficiency gains for large‑scale call‑center operations.

ITPUB
ITPUB
ITPUB
How 58.com Scales Voice Quality Inspection with AI-Powered Architecture

Background

Traditional voice‑quality inspection relies on human reviewers listening to a small fraction of call recordings, which limits coverage and efficiency (≈3 hours of audio per reviewer per day). Advances in automatic speech recognition (ASR) and natural language processing (NLP) enable large‑scale, automated inspection.

System Overview

58.com processes >100 million calls per year. The AI Lab built an end‑to‑end intelligent voice‑quality inspection platform that ingests raw call audio, converts it to text, extracts speaker roles, detects predefined quality‑issue tags, scores the conversation, and presents the results on a web management console for manual verification.

Overall Architecture

Foundation Layer : Provides core NLP (segmentation, clustering, classification, keyword & entity extraction) and ASR (speech‑to‑text, speaker separation) capabilities.

Data Layer : Real‑time ingestion via Kafka and the proprietary WMB message bus; supports both live streams and offline recordings.

Logic Layer : Core processing pipeline that performs (1) speaker‑role identification, (2) semantic‑tag detection, (3) speech scoring, and (4) notification. Role identification first applies a gender model for opposite‑sex speakers, then a Transformer‑based model and a TextCNN model with rule‑based corrections.

Editing/Operation Layer : Web UI for data labeling, model‑effectiveness evaluation, and analytics.

Web Management Layer : Manual review, task assignment, statistics, and result visualization.

Speech Recognition & Role Identification

For stereo recordings the two channels are already separated. The majority of recordings are mono; therefore a diarization step is required. Diarization Error Rate (DER) measures separation quality – lower DER indicates more accurate speaker splits.

After diarization, a two‑stage role identification is performed:

Gender‑based detection : A gender classifier labels each utterance as male or female; if the two speakers are of opposite gender, the gender label directly determines the agent‑customer roles.

General role model : When gender is ambiguous, a Transformer model and a TextCNN model predict the overall speaker role (agent vs. customer). Sentence‑level corrections are applied to fix isolated mis‑classifications caused by imperfect diarization.

Tag Recognition

The system defines business‑specific quality tags, e.g.:

Customer intends to complain

Sales insulted customer

Over‑promise

Work falsification

Three deep‑learning architectures are combined with rule‑based post‑processing:

TextCNN
Transformer
BERT

On internal test sets the sales‑tag detection accuracy exceeds 90 % and the service‑tag detection accuracy reaches 87 %.

Review System

The web management console displays inspection results per call. Reviewers can:

Navigate to a tag and jump to the corresponding audio snippet.

Play the snippet directly from the UI.

Add or correct tags manually.

This workflow reduces average review time by 2–3× compared with pure manual QA.

Backend Architecture

The backend is built on 58.com’s proprietary RPC framework SCF and monitored by WMonitor . Storage components are chosen according to data characteristics:

WOS – object storage for raw audio files.

Redis – fast key‑value cache.

WTable – KV store for ASR results and intermediate metadata.

WCS – private‑cloud search index (sub‑20 ms latency on tens of millions of records).

MySQL – relational store for final inspection results.

Micro‑service composition:

Data Service – ingests real‑time and offline call data, enriches with organizational info, and transfers raw audio to WOS.

Core Service – orchestrates the pipeline, fetches AB‑test configs, invokes ASR, speaker‑identification, and tag‑recognition services, and pushes notifications.

ASR Service – wraps third‑party transcription APIs, provides an asynchronous SCF interface, and stores transcripts in WTable.

Speaker‑Identification Service – runs the gender, Transformer, and TextCNN models.

Tag‑Recognition Service – runs TextCNN/Transformer/BERT models plus business rules.

Model inference is served by the online prediction platform WPAI . Results are synchronized to WCS for low‑latency query in the web UI.

Results and Outlook

The platform now processes hundreds of thousands of recordings daily across 13 business lines, saving nearly 1 000 staff‑hours and improving overall service quality. The same voice‑analysis stack can be reused for C2B scenarios such as low‑quality call filtering and customer‑need mining, where dual‑channel audio eliminates the need for diarization.

Future work focuses on:

Increasing role‑identification accuracy.

Improving tag‑recognition precision.

Providing a lightweight integration SDK for other business units.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIDeep Learningnatural language processingspeech recognitionvoice quality inspectioncall center automation
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.