Artificial Intelligence 11 min read

Technical Overview of 58.com Intelligent Voice Analysis Platform

The article presents a comprehensive technical overview of 58.com’s intelligent voice analysis platform, detailing its business background, system architecture, speech and NLP technologies, speaker diarization methods, model performance, data labeling workflow, and practical applications in call‑center quality inspection and user profiling.

58 Tech
58 Tech
58 Tech
Technical Overview of 58.com Intelligent Voice Analysis Platform

Background: 58.com’s life‑service platform connects millions of C‑end users with B‑end merchants, generating massive call recordings that are valuable for analysis. To unlock this value, the company built an intelligent voice analysis platform (code‑named “Lingxi”) that converts speech to text using proprietary speech recognition and applies natural language understanding for tasks such as voice quality inspection and user profiling.

Overall Architecture: The platform consists of a foundational service layer providing speech processing (mono‑channel speaker separation and ASR) and NLP capabilities (text classification, clustering, sequence labeling). On top of this, business‑specific text mining modules (speaker role identification, quality‑check tagging, user‑profile tagging, summarization) are built, supported by a custom annotation system and a web‑based integration layer for API registration, tag customization, and data export.

Speech and Semantic Technology Implementation: Initially the platform used third‑party speech engines, but since October 2019 the team has developed its own ASR and speaker‑separation models, surpassing third‑party performance by 2.4%–15.1% in word‑error rate. The ASR service runs on Docker and gRPC, decoding 20 hours of 8 kHz audio per CPU‑hour and 240 hours per GPU‑hour (T4). The speaker diarization pipeline uses VAD, a simplified ResNet‑34 to embed voice segments, and K‑means clustering, achieving a diarization error rate 6.6% lower than external vendors.

Model Details: The team employed Kaldi’s Chain Model (HMM‑DNN) and ESPnet’s Transformer‑CTC end‑to‑end model, with the Chain Model achieving a 2.4% absolute improvement over third‑party engines and the end‑to‑end model adding another 3.6% gain. Deployment leverages Docker containers and gRPC interfaces for scalable inference.

Applications: The platform powers a voice quality‑inspection system that automatically tags calls with issues such as profanity or false promises, dramatically reducing manual review effort. It also enables downstream AI services like complaint prediction, automatic call summarization, sales‑lead scoring, and richer user profiling based on spoken intent, supporting recommendation, governance, and agent‑performance evaluation.

Conclusion: The intelligent voice analysis platform opens new opportunities for 58.com’s AI capabilities, with future work focusing on improving recognition in colloquial and noisy environments, integrating multimodal pre‑training, and expanding vertical‑specific applications to create business value.

Natural Language ProcessingSpeech RecognitionAI Platformspeaker diarizationdata labelingvoice analytics
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.