Open-Source Qwen3‑TTS: Sub‑100 ms Latency, Runs on 8 GB GPU, and ComfyUI Integration
Qwen3‑TTS, an open‑source text‑to‑speech model from Alibaba, offers sub‑100 ms first‑packet latency, supports voice cloning, natural‑language voice design, and ten languages, can be deployed locally on a GPU with as little as 8 GB VRAM, and integrates with ComfyUI for visual workflow building.
Overview
Qwen3‑TTS is an open‑source text‑to‑speech model that achieves 97 ms first‑packet latency, runs on a single GPU with ~8 GB VRAM, and supports streaming generation. It covers ten languages and provides three usage modes: voice cloning, voice design, and preset custom voices.
Three usage modes
Voice cloning
Provide a ~3‑second reference audio (local file, URL, base64 string, or NumPy array) and the model learns the speaker’s characteristics. Example code:
import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-Base",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
ref_audio = "your_reference.wav"
ref_text = "reference transcript"
wavs, sr = model.generate_voice_clone(
text="What you want the AI to say",
language="Chinese",
ref_audio=ref_audio,
ref_text=ref_text,
)
sf.write("output.wav", wavs[0], sr)Voice design
No reference audio is required. Describe the desired voice style in natural language via the instruct parameter. Example:
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
wavs, sr = model.generate_voice_design(
text="哥哥,你回来啦,人家等了你好久好久了,要抱抱!",
language="Chinese",
instruct="体现撒娇稚嫩的萝莉女声,音调偏高且起伏明显,营造出黏人、做作又刻意卖萌的听觉效果。",
)
sf.write("output.wav", wavs[0], sr)Rich descriptors such as “用海盗那种粗犷的嗓音说话”, “用特别愤怒的语气说”, “17岁男性,男高音,说话时有点紧张” are supported.
Preset custom voices
The release ships several high‑quality preset voices (e.g., Vivian – Chinese female, Ryan – English male) that can be used without additional configuration.
Voice description guidelines
Five principles from the official API documentation:
Specific, not vague (e.g., “低沉”, “语速快”).
Multidimensional (combine gender, age, emotion, etc.).
Objective (describe acoustic features).
Original, not imitation (avoid requesting a specific celebrity’s voice).
Concise (each word adds meaning).
Technical highlights
Two tokenizers are provided:
Qwen‑TTS‑Tokenizer‑25Hz : single codebook, emphasizes semantic fidelity, suited for tasks requiring high semantic accuracy.
Qwen‑TTS‑Tokenizer‑12Hz : 16‑layer multi‑codebook, extreme bitrate compression, enables ultra‑low‑latency streaming (97 ms first‑packet latency) while preserving sub‑language and acoustic environment information.
The open‑source release uses the 12 Hz tokenizer. The architecture adopts a dual‑track design that bypasses the traditional LM + DiT bottleneck, yielding higher synthesis quality and faster generation. Mixed streaming generation allows the same model to produce both streaming and non‑streaming outputs with the same 97 ms first‑packet latency.
Performance comparison
Official benchmark latency (lower is better):
Qwen3‑TTS: 1.517 s total latency
Higgs‑Audio‑v2: 5.505 s
VoxCPM: 4.835 s
Training data consists of 5 million hours of speech covering ten languages.
Local deployment
Environment setup:
# Create environment
conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts
# Install core package
pip install -U qwen-tts
# Optional: install FlashAttention‑2 to reduce VRAM usage
pip install -U flash-attn --no-build-isolationStart the Web UI for each mode:
# Preset voice UI
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --ip 0.0.0.0 --port 8000
# Voice design UI
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign --ip 0.0.0.0 --port 8000
# Voice clone UI
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --ip 0.0.0.0 --port 8000Access the UI at http://localhost:8000.
vLLM deployment (day‑0 support)
git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni/examples/offline_inference/qwen3_tts
# Preset voice
python end2end.py --query-type CustomVoice
# Voice design
python end2end.py --query-type VoiceDesign
# Voice clone
python end2end.py --query-type Base --mode-tag iclOnly offline inference is currently supported.
ComfyUI integration
A community plugin ComfyUI‑Qwen‑TTS wraps the three functions into draggable nodes, enabling visual workflow construction without writing code. Installation:
cd ComfyUI/custom_nodes
git clone https://github.com/flybirdxx/ComfyUI-Qwen-TTS.git
cd ComfyUI-Qwen-TTS
pip install torch torchaudio transformers librosa accelerateAfter restarting ComfyUI the new nodes appear in the menu.
Practical tips
Voice‑clone node: use a reference audio of 5–15 seconds; shorter clips are unstable, longer clips give no extra benefit.
VRAM optimization: bf16 precision roughly halves memory usage with negligible quality loss.
Pre‑download model weights to ComfyUI/models/qwen-tts/ to avoid Hugging Face timeouts.
Typical workflow for video dubbing
Text node → Qwen3‑TTS VoiceDesign → Audio output
Text node → Qwen3‑TTS CustomVoice → Audio output
Text node → Qwen3‑TTS VoiceClone → Audio outputModel selection guide
Qwen3‑TTS‑12Hz‑1.7B‑CustomVoice – preset voices – ~8 GB VRAM.
Qwen3‑TTS‑12Hz‑1.7B‑VoiceDesign – natural‑language voice design – ~8 GB VRAM.
Qwen3‑TTS‑12Hz‑1.7B‑Base – voice cloning – ~8 GB VRAM.
Qwen3‑TTS‑12Hz‑0.6B‑CustomVoice – lightweight preset – ~4 GB VRAM.
Qwen3‑TTS‑12Hz‑0.6B‑Base – lightweight cloning – ~4 GB VRAM.
Advanced “design‑then‑clone” workflow
# Design reference audio
design_model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
...
)
ref_instruct = "17岁男性,男高音,说话时会有点紧张"
ref_wavs, sr = design_model.generate_voice_design(
text="参考文本", instruct=ref_instruct)
# Create clone prompt
clone_model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-Base",
...
)
voice_clone_prompt = clone_model.create_voice_clone_prompt(
ref_audio=(ref_wavs[0], sr), ...)
# Reuse for new text
wavs, sr = clone_model.generate_voice_clone(
text="新台词", voice_clone_prompt=voice_clone_prompt)Resources
GitHub: https://github.com/QwenLM/Qwen3-TTS
Hugging Face collection: https://huggingface.co/collections/Qwen/qwen3-tts
ModelScope collection: https://modelscope.cn/collections/Qwen/Qwen3-TTS
Blog post: https://qwen.ai/blog?id=qwen3tts-0115
Paper (arXiv): https://arxiv.org/abs/2601.15621
Hugging Face demo: https://huggingface.co/spaces/Qwen/Qwen3-TTS
ModelScope demo: https://modelscope.cn/studios/Qwen/Qwen3-TTS
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
