Open-Source Qwen3‑TTS: Sub‑100 ms Latency, Runs on 8 GB GPU, and ComfyUI Integration

Qwen3‑TTS, an open‑source text‑to‑speech model from Alibaba, offers sub‑100 ms first‑packet latency, supports voice cloning, natural‑language voice design, and ten languages, can be deployed locally on a GPU with as little as 8 GB VRAM, and integrates with ComfyUI for visual workflow building.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
Open-Source Qwen3‑TTS: Sub‑100 ms Latency, Runs on 8 GB GPU, and ComfyUI Integration

Overview

Qwen3‑TTS is an open‑source text‑to‑speech model that achieves 97 ms first‑packet latency, runs on a single GPU with ~8 GB VRAM, and supports streaming generation. It covers ten languages and provides three usage modes: voice cloning, voice design, and preset custom voices.

Three usage modes

Voice cloning

Provide a ~3‑second reference audio (local file, URL, base64 string, or NumPy array) and the model learns the speaker’s characteristics. Example code:

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

ref_audio = "your_reference.wav"
ref_text = "reference transcript"

wavs, sr = model.generate_voice_clone(
    text="What you want the AI to say",
    language="Chinese",
    ref_audio=ref_audio,
    ref_text=ref_text,
)
sf.write("output.wav", wavs[0], sr)

Voice design

No reference audio is required. Describe the desired voice style in natural language via the instruct parameter. Example:

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

wavs, sr = model.generate_voice_design(
    text="哥哥,你回来啦,人家等了你好久好久了,要抱抱!",
    language="Chinese",
    instruct="体现撒娇稚嫩的萝莉女声,音调偏高且起伏明显,营造出黏人、做作又刻意卖萌的听觉效果。",
)
sf.write("output.wav", wavs[0], sr)

Rich descriptors such as “用海盗那种粗犷的嗓音说话”, “用特别愤怒的语气说”, “17岁男性,男高音,说话时有点紧张” are supported.

Preset custom voices

The release ships several high‑quality preset voices (e.g., Vivian – Chinese female, Ryan – English male) that can be used without additional configuration.

Voice description guidelines

Five principles from the official API documentation:

Specific, not vague (e.g., “低沉”, “语速快”).

Multidimensional (combine gender, age, emotion, etc.).

Objective (describe acoustic features).

Original, not imitation (avoid requesting a specific celebrity’s voice).

Concise (each word adds meaning).

Technical highlights

Two tokenizers are provided:

Qwen‑TTS‑Tokenizer‑25Hz : single codebook, emphasizes semantic fidelity, suited for tasks requiring high semantic accuracy.

Qwen‑TTS‑Tokenizer‑12Hz : 16‑layer multi‑codebook, extreme bitrate compression, enables ultra‑low‑latency streaming (97 ms first‑packet latency) while preserving sub‑language and acoustic environment information.

The open‑source release uses the 12 Hz tokenizer. The architecture adopts a dual‑track design that bypasses the traditional LM + DiT bottleneck, yielding higher synthesis quality and faster generation. Mixed streaming generation allows the same model to produce both streaming and non‑streaming outputs with the same 97 ms first‑packet latency.

Performance comparison

Official benchmark latency (lower is better):

Qwen3‑TTS: 1.517 s total latency

Higgs‑Audio‑v2: 5.505 s

VoxCPM: 4.835 s

Training data consists of 5 million hours of speech covering ten languages.

Local deployment

Environment setup:

# Create environment
conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts

# Install core package
pip install -U qwen-tts

# Optional: install FlashAttention‑2 to reduce VRAM usage
pip install -U flash-attn --no-build-isolation

Start the Web UI for each mode:

# Preset voice UI
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --ip 0.0.0.0 --port 8000

# Voice design UI
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign --ip 0.0.0.0 --port 8000

# Voice clone UI
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --ip 0.0.0.0 --port 8000

Access the UI at http://localhost:8000.

vLLM deployment (day‑0 support)

git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni/examples/offline_inference/qwen3_tts

# Preset voice
python end2end.py --query-type CustomVoice

# Voice design
python end2end.py --query-type VoiceDesign

# Voice clone
python end2end.py --query-type Base --mode-tag icl

Only offline inference is currently supported.

ComfyUI integration

A community plugin ComfyUI‑Qwen‑TTS wraps the three functions into draggable nodes, enabling visual workflow construction without writing code. Installation:

cd ComfyUI/custom_nodes
git clone https://github.com/flybirdxx/ComfyUI-Qwen-TTS.git
cd ComfyUI-Qwen-TTS
pip install torch torchaudio transformers librosa accelerate

After restarting ComfyUI the new nodes appear in the menu.

Practical tips

Voice‑clone node: use a reference audio of 5–15 seconds; shorter clips are unstable, longer clips give no extra benefit.

VRAM optimization: bf16 precision roughly halves memory usage with negligible quality loss.

Pre‑download model weights to ComfyUI/models/qwen-tts/ to avoid Hugging Face timeouts.

Typical workflow for video dubbing

Text node → Qwen3‑TTS VoiceDesign → Audio output
Text node → Qwen3‑TTS CustomVoice → Audio output
Text node → Qwen3‑TTS VoiceClone → Audio output

Model selection guide

Qwen3‑TTS‑12Hz‑1.7B‑CustomVoice – preset voices – ~8 GB VRAM.

Qwen3‑TTS‑12Hz‑1.7B‑VoiceDesign – natural‑language voice design – ~8 GB VRAM.

Qwen3‑TTS‑12Hz‑1.7B‑Base – voice cloning – ~8 GB VRAM.

Qwen3‑TTS‑12Hz‑0.6B‑CustomVoice – lightweight preset – ~4 GB VRAM.

Qwen3‑TTS‑12Hz‑0.6B‑Base – lightweight cloning – ~4 GB VRAM.

Advanced “design‑then‑clone” workflow

# Design reference audio
design_model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
    ...
)
ref_instruct = "17岁男性,男高音,说话时会有点紧张"
ref_wavs, sr = design_model.generate_voice_design(
    text="参考文本", instruct=ref_instruct)

# Create clone prompt
clone_model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    ...
)
voice_clone_prompt = clone_model.create_voice_clone_prompt(
    ref_audio=(ref_wavs[0], sr), ...)

# Reuse for new text
wavs, sr = clone_model.generate_voice_clone(
    text="新台词", voice_clone_prompt=voice_clone_prompt)

Resources

GitHub: https://github.com/QwenLM/Qwen3-TTS

Hugging Face collection: https://huggingface.co/collections/Qwen/qwen3-tts

ModelScope collection: https://modelscope.cn/collections/Qwen/Qwen3-TTS

Blog post: https://qwen.ai/blog?id=qwen3tts-0115

Paper (arXiv): https://arxiv.org/abs/2601.15621

Hugging Face demo: https://huggingface.co/spaces/Qwen/Qwen3-TTS

ModelScope demo: https://modelscope.cn/studios/Qwen/Qwen3-TTS

vLLMOpen Sourcelow-latencytext-to-speechvoice cloningComfyUIQwen3-TTSvoice design
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.