Deploy Alibaba’s Qwen3‑TTS on Ubuntu and Clone Your Voice in 3 Seconds

This guide walks through installing the open‑source Qwen3‑TTS model on Ubuntu, covering environment setup, GPU requirements, package installation, model variants, and hands‑on Python scripts for ultra‑low‑latency voice cloning and text‑driven voice design.

Ubuntu
Ubuntu
Ubuntu
Deploy Alibaba’s Qwen3‑TTS on Ubuntu and Clone Your Voice in 3 Seconds

Alibaba's Qwen team released the Qwen3‑TTS model, which supports ten languages, 3‑second voice cloning, voice design from text, and end‑to‑end synthesis latency as low as 97 ms, making it suitable for real‑time dialogue.

Core Highlights

Multilingual : supports Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian.

Fast Voice Cloning : a 3‑second reference audio yields high‑similarity clone (Zero‑shot Voice Cloning).

Voice Design : generate a voice from a textual description.

Ultra‑low latency : synthesis latency as low as 97 ms, ideal for real‑time scenarios.

Instruction Control : natural‑language commands adjust tone, speed, and emotion.

Environment Setup

Deploy on Ubuntu 22.04/24.04 with at least 12 GB GPU memory (24 GB recommended). Required software: Ubuntu 20.04+, Python 3.10+, CUDA 11.8+, PyTorch 2.0+.

Deployment Steps

1. Create a Python virtual environment

# Create a conda environment named qwen-tts with Python 3.10
conda create -n qwen-tts python=3.10 -y

# Activate the environment
conda activate qwen-tts

2. Install Qwen3‑TTS

Install the released package:

pip install qwen-tts

Or install the latest development version from source:

git clone https://github.com/QwenLM/Qwen3-TTS.git
cd Qwen3-TTS
pip install -e .

3. Model download

The qwen-tts library automatically fetches model weights from Hugging Face or ModelScope. Available model variants:

Qwen3‑TTS‑12Hz‑1.7B‑VoiceDesign : optimized for voice design from text.

Qwen3‑TTS‑12Hz‑1.7B‑Base : baseline model for voice cloning.

0.6B version : lightweight model for resource‑constrained devices.

Code Demo

Scenario 1: Voice Cloning

Provide a 3‑10 s reference audio (e.g., my_voice.wav) and generate speech that mimics the speaker.

import torch
from qwen_tts.pipeline import QwenTTSPipeline

pipeline = QwenTTSPipeline(
    model_id="Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device="cuda"
)

ref_audio_path = "my_voice.wav"
ref_text = "这是我的一段录音,用于声音克隆。"
text_to_generate = "大家好,这是我用 Qwen3-TTS 克隆出来的声音,听起来像我吗?"

audio = pipeline.run(
    text=text_to_generate,
    ref_audio_path=ref_audio_path,
    ref_text=ref_text
)

import scipy.io.wavfile
scipy.io.wavfile.write("cloned_output.wav", pipeline.sample_rate, audio)
print("Voice cloning completed, saved as cloned_output.wav")

Scenario 2: Voice Design

Generate a voice from a textual description without a reference audio.

from qwen_tts.pipeline import QwenTTSPipeline

pipeline = QwenTTSPipeline(
    model_id="Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
    device="cuda"
)

voice_description = "一个年轻女性的声音,语气温柔,带有轻微的南方口音,语速适中。"
text = "生活不止眼前的苟且,还有诗和远方的田野。"

audio = pipeline.run(
    text=text,
    voice_description=voice_description
)

import scipy.io.wavfile
scipy.io.wavfile.write("designed_voice.wav", pipeline.sample_rate, audio)
print("Voice design completed, saved as designed_voice.wav")

Web UI Demo

If a graphical interface is preferred, the repository provides a Gradio/Streamlit demo. After installing web dependencies, run:

# Assuming the source has been cloned
cd Qwen3-TTS
pip install -r requirements_web.txt
python web_demo.py

Then open http://127.0.0.1:7860 in a browser to record, type text, and synthesize audio.

Common Issues

Insufficient VRAM : the 1.7 B model in FP16 needs ~6‑8 GB VRAM. Use the 0.6 B version or a quantized model if memory is limited.

Network slowdown : if downloading from Hugging Face is slow, switch to ModelScope mirrors.

Inference speed : streaming generation is fast, but non‑streaming long‑text synthesis may take time; split long texts into sentences.

Conclusion

Qwen3‑TTS lowers the barrier for high‑quality open‑source speech synthesis. It is suitable for video dubbing, intelligent assistants, or personal experiments on an Ubuntu server with a GPU.

References:

GitHub: https://github.com/QwenLM/Qwen3-TTS

Qwen Blog: https://qwen.ai/blog

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonPyTorchvoice cloningUbuntuQwen3-TTSmultilingual TTSAI speech synthesis
Ubuntu
Written by

Ubuntu

Focused on Ubuntu/Linux tech sharing, offering the latest news, practical tools, beginner tutorials, and problem solutions. Connecting open-source enthusiasts to build a Linux learning community. Join our QQ group or channel for discussion!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.