How Kuaishou Delivered Real‑Time Deep‑Learning Voice Conversion on PC
Kuaishou becomes the first company to deploy a deep‑learning‑based real‑time voice‑conversion system on PC clients, delivering stable, natural‑sounding transformed speech with sub‑200 ms latency, and the article analyzes industry methods, technical challenges, and the four‑module architecture that made it possible.
Industry Context
Live‑streaming platforms such as Kuaishou and AcFun rely on interactive features like voice changing to engage audiences. Traditional DSP‑based voice changers modify pitch and formants, resulting in inconsistent quality, audible artifacts, and limited control, especially for cross‑gender conversion.
Deep‑learning‑based voice conversion promises higher naturalness and stable target timbres, but it demands substantial computational resources and typically runs on servers, making real‑time client‑side deployment challenging.
Kuaishou's Technical Breakthrough
Facing the need for high‑quality, low‑latency voice conversion on PC clients, Kuaishou's audio‑video technology and multimedia understanding teams optimized existing deep‑learning models. They compressed model size, introduced streaming processing, and leveraged multi‑core parallelism to achieve sub‑200 ms end‑to‑end latency while preserving naturalness.
System Architecture
The final system comprises four core modules:
Noise Reduction Model : A deep neural network denoises the broadcaster’s input, improving robustness against environmental noise.
Phoneme‑Level Feature Encoder : Extracts deep bottleneck features from raw audio, with size and compute optimizations for on‑device execution.
Voice Conversion Model : Maps speaker‑independent bottleneck features to target speaker acoustic characteristics using a non‑autoregressive encoder‑decoder framework, supporting multiple output timbres.
Neural Vocoder : Converts high‑dimensional features to waveform with high fidelity, high sampling rate, and low complexity.
Additional innovations include a multi‑core parallel processing architecture and a low‑latency jitter buffer to mitigate network jitter, further enhancing stability on client devices.
Future Directions
Kuaishou plans to extend the system with bidirectional dialect‑Mandarin conversion, personalized voice styles, and broader device support, continuing to push AI‑enabled audio interaction in live streaming.
References
Conference papers: Ying Zhang et al., “One‑shot Voice Conversion Based ON Speaker Aware Module,” ICASSP 2021; Ying Zhang et al., “Non‑parallel Sequence‑to‑Sequence Voice Conversion for Arbitrary Speakers,” ISCSLP 2021.
Patents: 2021KI0494CN (Live streaming voice conversion), 2020KI1910CN (One‑sentence arbitrary speaker voice conversion), among others.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
