Artificial Intelligence 19 min read

Voice Cloning Technology in AI Sales Assistant

This article introduces the AI sales assistant from 58.com, detailing its background, a few‑shot voice cloning approach using real dialogue data, multi‑accent naturalness optimization, deployment architecture, and future plans, while evaluating performance metrics and discussing challenges in speech synthesis quality and stability.

58 Tech
58 Tech
58 Tech
Voice Cloning Technology in AI Sales Assistant

The AI Sales Assistant, developed by 58.com AI Lab, aims to improve sales efficiency by automating outbound calls and lead qualification using conversational AI technologies such as speech recognition, semantic understanding, and speech synthesis.

The presentation is divided into five parts: background of the AI Sales Assistant, few‑shot voice cloning based on real dialogue data, multi‑accent naturalness optimization, deployment of the voice cloning service, and future planning.

For voice cloning, a pipeline processes real sales call recordings: VAD extracts speech segments, short clips (2‑10 s) are up‑sampled to 16 kHz, and an open‑source quality scoring model filters high‑quality samples. Selected clips are transcribed with ASR, manually corrected, loudness‑normalized, and aligned using MFA to obtain phoneme durations for model training.

Training uses a FastSpeech2 acoustic model with a speaker‑embedding layer for multi‑speaker support, replacing the unstable Tacotron2. The model is enhanced with a Conformer encoder for clearer articulation and a Multi‑Band MelGAN vocoder for stable waveform generation. Long‑sentence synthesis is handled by splitting text, synthesizing each fragment, applying duration masks, and concatenating the spectra before vocoding.

Evaluation includes intelligibility (understandability), naturalness, and speaker similarity scores, showing that cloned voices achieve comparable quality to human recordings, though AB tests reveal a slight drop in conversion rate during real outbound calls.

To improve naturalness across accents and speaking styles, three modules are introduced: audio quality enhancement (denoising, reverberation removal, and avoiding silence‑trim artifacts), pronunciation and stability improvements (using gradient‑reversal layers and Conformer encoders), and text‑style transfer via a Prompt‑based T5 model (PromptCLUE) that rewrites formal scripts into colloquial, accent‑aware text.

The deployment architecture separates a front‑end text analysis service (normalization, tokenization, phoneme and prosody extraction) from a back‑end service that hosts the acoustic model and vocoder via TensorFlow Serving, achieving an average real‑time factor of ~0.02.

Controllable generation is supported by scaling duration for speed control, modifying pitch via the front‑end phoneme module, and inserting explicit pause tokens to achieve precise silences.

Future work focuses on further improving audio quality and consistency, accelerating voice cloning for new speakers, and exploring zero‑shot cloning techniques.

few-shot learningSpeech Synthesistext-to-speechvoice cloningAI sales assistantmulti-accent
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.