8 Must‑Watch Open‑Source TTS Projects for 2026

This article reviews eight open‑source text‑to‑speech systems—from lightweight, CPU‑only models to multilingual, podcast‑focused engines—detailing their architectures, language coverage, benchmark scores, licensing, and practical use‑case recommendations.

Geek Labs
Geek Labs
Geek Labs
8 Must‑Watch Open‑Source TTS Projects for 2026

Qwen3‑TTS: Alibaba Tongyi Qianwen Team’s All‑Rounder

Qwen3‑TTS is an open‑source TTS series released by Alibaba Cloud’s Tongyi Qianwen team, built on the self‑developed Qwen3‑TTS‑Tokenizer‑12Hz acoustic encoder and a discrete‑codebook language‑model architecture for end‑to‑end speech modeling.

It supports ten languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian) and offers 0.6 B and 1.7 B parameter variants. Its key highlight is a dual‑track hybrid streaming architecture that enables both streaming and non‑streaming generation with a minimum end‑to‑end latency of 97 ms, allowing a single character to be spoken immediately.

The model also accepts natural‑language instruction control, e.g., “use a deeper voice, slower, with a sad tone,” and automatically adjusts pitch, rhythm, and emotion.

GitHub Stars have surpassed 11.9 k, making it one of the most active open‑source TTS projects.

Qwen3‑TTS project homepage
Qwen3‑TTS project homepage

dots.tts: Xiaohongshu HiLab’s Fully Continuous AR Solution

dots.tts is a 2 B‑parameter TTS system from Xiaohongshu HiLab. Its distinguishing feature is full‑continuous modeling without discrete tokens—semantic encoding, LLM, and acoustic head are all modeled in a continuous space, outputting 48 kHz high‑sample‑rate audio.

On the Seed‑TTS‑Eval benchmark it achieves the best open‑source average results: Chinese WER 0.94 %, English WER 1.30 %, difficult‑set WER 6.60 %; speaker similarity scores of 81.0/77.1/79.5. On the MiniMax multilingual benchmark (24 languages) it attains an average speaker similarity of 83.9, ranking first.

Three checkpoints are provided: a pretrained version, a self‑correcting alignment version (SCA) with the best cloning quality, and a MeanFlow distilled version that requires only four inference steps and yields an initial packet latency of 85 ms. All code is Apache 2.0 and supports fine‑tuning and distillation.

dots.tts project homepage
dots.tts project homepage

Fish Speech: Established Multilingual TTS Benchmark

Fish Speech, developed by the Fish Audio team, is a large‑scale multilingual TTS system with 4 B parameters, trained on over 10 million hours of data and supporting roughly 50 languages.

It uses a dual autoregressive architecture (Slow AR + Fast AR) combined with GRPO reinforcement learning for alignment. On Chinese and English WER it reaches 0.54 % and 0.99 % respectively. Fine‑grained emotional control is possible via tags such as [laugh], [whispers], and [super happy].

Cloning a voice takes 10–30 seconds, and production‑grade streaming inference is enabled through SGLang. The repository has 30.8 k Stars; the license is CC‑BY‑NC‑SA‑4.0, requiring attention for commercial use.

Fish Speech project homepage
Fish Speech project homepage

SoulX‑Podcast: Designed for Podcast Scenarios

SoulX‑Podcast, from the Soul AI team, is a 1.7 B‑parameter system specialized for multi‑turn, multi‑speaker dialogue. It supports auxiliary language tags such as [laughter], [sigh], and [breath] to make generated conversations more natural.

The engine handles Mandarin, English, and regional dialects (Sichuan, Henan, Cantonese) and can perform zero‑shot voice cloning across dialects.

It can continuously generate over 90 minutes of multi‑speaker dialogue with stable timbre and context‑aware prosody. GitHub shows 3.4 k Stars and the project is Apache 2.0 licensed.

SoulX‑Podcast project homepage
SoulX‑Podcast project homepage

Supertonic 3: Ultra‑Lightweight Device‑Side Solution

Supertonic, from the Supertone team (now part of Hybe), offers a device‑side TTS system. Version 3 contains only 99 M parameters and supports 31 languages.

The core advantage is speed: on consumer hardware it achieves 167× real‑time speed via ONNX Runtime, without requiring a GPU. It runs on Python, Node.js, browsers (WebGPU), Java, C++, Swift, Flutter, and other platforms.

v3 adds ten expressive tags and outputs 44.1 kHz high‑quality audio. The repository has 12.3 k Stars and is MIT licensed, suitable for privacy‑sensitive local deployment.

Supertonic project homepage
Supertonic project homepage

Voicebox: One‑Stop AI Voice Studio

Voicebox is a locally‑first AI voice desktop application that integrates seven TTS engines (including Qwen3‑TTS, Kokoro, Chatterbox, LuxTTS, HumeAI TADA, etc.) and supports 23 languages.

It is not a single model but a “voice I/O full stack”: it can synthesize speech, perform speech‑to‑text transcription via Whisper, and includes a local LLM for text polishing. Features include voice cloning, multi‑track story editing, global transcription shortcuts, and MCP protocol integration for AI programming assistants like Claude Code.

Built with Tauri (Rust) for native performance on macOS, Windows, and Linux. The project has 30 k Stars and is MIT licensed.

Voicebox project homepage
Voicebox project homepage

Kokoro TTS: CLI Geek’s Favorite

Kokoro TTS is a command‑line text‑to‑speech tool based on an ONNX model with 82 M parameters, targeting terminal users.

It supports eight languages (including Mandarin), provides over 50 preset voices, and allows voice mixing (e.g., 60 % Sarah + 40 % Michael). Input formats include TXT, EPUB, PDF; it also supports streaming playback and chapter splitting.

Installation requires a single command: pip install kokoro-tts. After downloading the model, it is ready to use. The repository has 1.6 k Stars and is MIT licensed.

Kokoro TTS project homepage
Kokoro TTS project homepage

Pocket TTS: Lightweight CPU‑Only Option

Pocket TTS, developed by France’s Kyutai Lab, is a 100 M‑parameter model optimized for CPU execution.

On a MacBook Air M4 it reaches roughly 6× real‑time speed using only two CPU cores, with an initial frame latency of about 200 ms. It supports six languages (English, French, German, Portuguese, Italian, Spanish) and can run in browsers via WebAssembly.

The repository has 4.6 k Stars and is MIT licensed, making it suitable for edge devices, mobile, and browser scenarios.

Pocket TTS project homepage
Pocket TTS project homepage

Quick Project Overview

Qwen3‑TTS · 1.7 B params · 10 languages · Chinese support · 3 s voice cloning · 97 ms streaming · Apache 2.0 · ⭐ 11.9 k

dots.tts · 2 B params · 24+ languages · Chinese support · voice cloning · 85 ms streaming · Apache 2.0 · ⭐ ≈ 700

Fish Speech · 4 B params · ~50 languages · Chinese support · 10 s voice cloning · streaming · CC‑BY‑NC‑SA · ⭐ 30.8 k

SoulX‑Podcast · 1.7 B params · Mandarin, English, dialects · Chinese support · voice cloning · no streaming · Apache 2.0 · ⭐ 3.4 k

Supertonic 3 · 99 M params · 31 languages · no Chinese support · voice cloning · no streaming · MIT · ⭐ 12.3 k

Voicebox · Multi‑engine integration · 23 languages · Chinese support · voice cloning · streaming · MIT · ⭐ 30 k

Kokoro TTS · 82 M params · 8 languages · Chinese support · no cloning · streaming · MIT · ⭐ 1.6 k

Pocket TTS · 100 M params · 6 languages · no Chinese support · voice cloning · no streaming · MIT · ⭐ 4.6 k

Official Evaluation Data (Seed‑TTS‑Eval)

dots.tts (SCA) : Chinese 0.94 % / English 1.30 % / Difficult 6.60 % / Avg 2.95 %

Qwen3‑TTS (1.7B) : Chinese 1.22 % / English 1.23 % / Difficult 6.76 % / Avg 3.07 %

CosyVoice 3 : Chinese 1.12 % / English 2.22 % / Difficult 5.83 % / Avg 3.06 %

VoxCPM 2 : Chinese 0.97 % / English 1.84 % / Difficult 8.13 % / Avg 3.65 %

dots.tts achieves the best overall open‑source performance, while Qwen3‑TTS leads on the English metric.

How to Choose?

Prioritize audio quality and Chinese performance → dots.tts or Fish Speech. dots.tts tops the Seed‑TTS‑Eval benchmark; Fish Speech has the most mature ecosystem.

Need real‑time streaming → Qwen3‑TTS, with 97 ms end‑to‑end latency and character‑level instant output.

Building podcasts or multi‑speaker dialogue → SoulX‑Podcast, the only solution designed for multi‑turn conversations, supporting dialects and auxiliary tags.

Run locally on lightweight hardware → Supertonic 3, 99 M parameters, 31 languages, hardware‑agnostic.

Want an out‑of‑the‑box desktop app → Voicebox, which bundles seven engines with a single installer.

Command‑line power user / e‑book audio → Kokoro TTS, a one‑command setup.

Target edge devices or browsers → Pocket TTS, pure CPU execution with ~6× real‑time speed.

GitHub project links: Qwen3‑TTS: https://github.com/QwenLM/Qwen3-TTS dots.tts: https://github.com/rednote-hilab/dots.tts Fish Speech: https://github.com/fishaudio/fish-speech SoulX‑Podcast: https://github.com/Soul-AILab/SoulX-Podcast Supertonic: https://github.com/supertone-inc/supertonic Voicebox: https://github.com/jamiepine/voicebox Kokoro TTS: https://github.com/nazdridoy/kokoro-tts Pocket TTS: https://github.com/kyutai-labs/pocket-tts
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIopen sourcebenchmarkmultilingualspeech synthesistext-to-speech
Geek Labs
Written by

Geek Labs

Daily shares of interesting GitHub open-source projects. AI tools, automation gems, technical tutorials, open-source inspiration.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.