Industry Insights 10 min read

How Kuaishou Delivered Real‑Time Deep‑Learning Voice Conversion on PC

Kuaishou becomes the first company to deploy a deep‑learning‑based real‑time voice‑conversion system on PC clients, delivering stable, natural‑sounding transformed speech with sub‑200 ms latency, and the article analyzes industry methods, technical challenges, and the four‑module architecture that made it possible.

Kuaishou Tech

May 17, 2021

How Kuaishou Delivered Real‑Time Deep‑Learning Voice Conversion on PC

Industry Context

Live‑streaming platforms such as Kuaishou and AcFun rely on interactive features like voice changing to engage audiences. Traditional DSP‑based voice changers modify pitch and formants, resulting in inconsistent quality, audible artifacts, and limited control, especially for cross‑gender conversion.

Deep‑learning‑based voice conversion promises higher naturalness and stable target timbres, but it demands substantial computational resources and typically runs on servers, making real‑time client‑side deployment challenging.

Kuaishou's Technical Breakthrough

Facing the need for high‑quality, low‑latency voice conversion on PC clients, Kuaishou's audio‑video technology and multimedia understanding teams optimized existing deep‑learning models. They compressed model size, introduced streaming processing, and leveraged multi‑core parallelism to achieve sub‑200 ms end‑to‑end latency while preserving naturalness.

System Architecture

The final system comprises four core modules:

Noise Reduction Model : A deep neural network denoises the broadcaster’s input, improving robustness against environmental noise.

Phoneme‑Level Feature Encoder : Extracts deep bottleneck features from raw audio, with size and compute optimizations for on‑device execution.

Voice Conversion Model : Maps speaker‑independent bottleneck features to target speaker acoustic characteristics using a non‑autoregressive encoder‑decoder framework, supporting multiple output timbres.

Neural Vocoder : Converts high‑dimensional features to waveform with high fidelity, high sampling rate, and low complexity.

Additional innovations include a multi‑core parallel processing architecture and a low‑latency jitter buffer to mitigate network jitter, further enhancing stability on client devices.

Future Directions

Kuaishou plans to extend the system with bidirectional dialect‑Mandarin conversion, personalized voice styles, and broader device support, continuing to push AI‑enabled audio interaction in live streaming.

References

Conference papers: Ying Zhang et al., “One‑shot Voice Conversion Based ON Speaker Aware Module,” ICASSP 2021; Ying Zhang et al., “Non‑parallel Sequence‑to‑Sequence Voice Conversion for Arbitrary Speakers,” ISCSLP 2021.

Patents: 2021KI0494CN (Live streaming voice conversion), 2020KI1910CN (One‑sentence arbitrary speaker voice conversion), among others.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Live Streaming Deep Learning Audio Processing Kuaishou Industry insight real-time voice conversion

Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.