Deploy Alibaba Qwen3‑TTS on Ubuntu: 3‑Second Voice Cloning with 97 ms Latency

This guide walks through installing and running Alibaba's open‑source Qwen3‑TTS on Ubuntu, covering environment setup, GPU requirements, model selection, Python virtual‑environment creation, code examples for voice cloning and voice design, low‑latency streaming, Web UI launch, and common troubleshooting tips.

Ubuntu
Ubuntu
Ubuntu
Deploy Alibaba Qwen3‑TTS on Ubuntu: 3‑Second Voice Cloning with 97 ms Latency

Qwen3‑TTS Core Highlights

Before deployment, note the model’s capabilities: multilingual support for ten languages, 3‑second zero‑shot voice cloning, voice design from textual description, end‑to‑end synthesis latency as low as 97 ms, and natural‑language control of tone, speed, and emotion.

All‑rounder : supports Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian.

Ultra‑fast voice cloning : a 3‑second reference audio yields a highly similar synthetic voice.

Voice design : describe desired timbre (e.g., “a deep, magnetic middle‑aged male voice”) and the model generates the corresponding sound.

Ultra‑low latency : end‑to‑end synthesis delay down to 97 ms, suitable for real‑time dialogue.

Instruction control : natural‑language commands adjust speaking style, speed, and emotion.

Environment Preparation

Deploy on Ubuntu 22.04/24.04 with an NVIDIA GPU of at least 12 GB VRAM (24 GB recommended for best experience).

Basic requirements:

OS: Ubuntu 20.04+

Python: 3.10+

CUDA: 11.8+

PyTorch: 2.0+

Deployment Steps

1. Create a Python virtual environment

Use conda to avoid polluting the system environment.

# Create a virtual environment named qwen-tts with Python 3.10
conda create -n qwen-tts python=3.10 -y

# Activate the environment
conda activate qwen-tts

2. Install Qwen3‑TTS

Install the released Python package via pip, or build from source for the latest development version. pip install qwen-tts To install from source:

git clone https://github.com/QwenLM/Qwen3-TTS.git
cd Qwen3-TTS
pip install -e .

Note: the installation will automatically download PyTorch and other dependencies; ensure a stable internet connection.

3. Model download

The qwen-tts library automatically fetches model weights from Hugging Face or ModelScope.

Available model variants:

Qwen3‑TTS-12Hz-1.7B-VoiceDesign : excels at generating voices from textual descriptions.

Qwen3‑TTS-12Hz-1.7B-Base : the base model, optimized for voice cloning.

0.6B version : lightweight, suitable for resource‑constrained devices.

Code Demonstrations

After installation, the following scripts illustrate the main functionalities.

Scenario 1: Voice Cloning

Provide a 3‑10 second reference audio (e.g., my_voice.wav) and optionally its transcript.

import torch
from qwen_tts.pipeline import QwenTTSPipeline

# Initialize pipeline with the Base model for cloning
pipeline = QwenTTSPipeline(
    model_id="Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device="cuda"
)

ref_audio_path = "my_voice.wav"
ref_text = "这是我的一段录音,用于声音克隆。"
text_to_generate = "大家好,这是我用 Qwen3-TTS 克隆出来的声音,听起来像我吗?"

audio = pipeline.run(
    text=text_to_generate,
    ref_audio_path=ref_audio_path,
    ref_text=ref_text
)

import scipy.io.wavfile
scipy.io.wavfile.write("cloned_output.wav", pipeline.sample_rate, audio)
print("声音克隆完成,已保存为 cloned_output.wav")

Scenario 2: Voice Design

Generate a voice from a textual description without any reference audio.

from qwen_tts.pipeline import QwenTTSPipeline

pipeline = QwenTTSPipeline(
    model_id="Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
    device="cuda"
)

voice_description = "一个年轻女性的声音,语气温柔,带有轻微的南方口音,语速适中。"
text = "生活不止眼前的苟且,还有诗和远方的田野。"

audio = pipeline.run(
    text=text,
    voice_description=voice_description
)

import scipy.io.wavfile
scipy.io.wavfile.write("designed_voice.wav", pipeline.sample_rate, audio)
print("声音设计完成,已保存为 designed_voice.wav")

Launch Web UI Demo

If you prefer a graphical interface, the repository typically includes a Gradio or Streamlit demo.

# Assuming the source has been cloned
cd Qwen3-TTS
pip install -r requirements_web.txt  # install UI dependencies if present
python web_demo.py

Then open http://127.0.0.1:7860 in a browser to record, type text, and synthesize audio.

Common Issues and Tips

Insufficient VRAM : the 1.7B model needs ~6‑8 GB VRAM in FP16. Use the 0.6B version or a quantized model if memory is limited.

Network problems : if downloading from Hugging Face is slow, try the ModelScope mirror.

Inference speed : streaming is fast, but non‑streaming generation of long texts may still take noticeable time; split long inputs into sentences.

Conclusion

Qwen3‑TTS lowers the barrier for high‑quality open‑source speech synthesis. Whether you need voiceovers for videos, intelligent assistants, or personal experiments, the model provides multilingual support, rapid voice cloning, and low‑latency streaming on Ubuntu systems with a GPU.

If you have an Ubuntu server and an idle GPU, follow the steps above to experience AI‑driven voice generation.

References

Qwen3‑TTS GitHub: https://github.com/QwenLM/Qwen3-TTS

Qwen Blog: https://qwen.ai/blog

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonAIdeep learningtext-to-speechvoice cloningUbuntuQwen3-TTS
Ubuntu
Written by

Ubuntu

Focused on Ubuntu/Linux tech sharing, offering the latest news, practical tools, beginner tutorials, and problem solutions. Connecting open-source enthusiasts to build a Linux learning community. Join our QQ group or channel for discussion!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.