Artificial Intelligence 15 min read

8 Must‑Watch Open‑Source TTS Projects for 2026

This article reviews eight open‑source text‑to‑speech systems—from lightweight, CPU‑only models to multilingual, podcast‑focused engines—detailing their architectures, language coverage, benchmark scores, licensing, and practical use‑case recommendations.

Geek Labs

Jun 18, 2026

8 Must‑Watch Open‑Source TTS Projects for 2026

Qwen3‑TTS: Alibaba Tongyi Qianwen Team’s All‑Rounder

Qwen3‑TTS is an open‑source TTS series released by Alibaba Cloud’s Tongyi Qianwen team, built on the self‑developed Qwen3‑TTS‑Tokenizer‑12Hz acoustic encoder and a discrete‑codebook language‑model architecture for end‑to‑end speech modeling.

It supports ten languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian) and offers 0.6 B and 1.7 B parameter variants. Its key highlight is a dual‑track hybrid streaming architecture that enables both streaming and non‑streaming generation with a minimum end‑to‑end latency of 97 ms, allowing a single character to be spoken immediately.

The model also accepts natural‑language instruction control, e.g., “use a deeper voice, slower, with a sad tone,” and automatically adjusts pitch, rhythm, and emotion.

GitHub Stars have surpassed 11.9 k, making it one of the most active open‑source TTS projects.

dots.tts: Xiaohongshu HiLab’s Fully Continuous AR Solution

dots.tts is a 2 B‑parameter TTS system from Xiaohongshu HiLab. Its distinguishing feature is full‑continuous modeling without discrete tokens—semantic encoding, LLM, and acoustic head are all modeled in a continuous space, outputting 48 kHz high‑sample‑rate audio.

On the Seed‑TTS‑Eval benchmark it achieves the best open‑source average results: Chinese WER 0.94 %, English WER 1.30 %, difficult‑set WER 6.60 %; speaker similarity scores of 81.0/77.1/79.5. On the MiniMax multilingual benchmark (24 languages) it attains an average speaker similarity of 83.9, ranking first.

Three checkpoints are provided: a pretrained version, a self‑correcting alignment version (SCA) with the best cloning quality, and a MeanFlow distilled version that requires only four inference steps and yields an initial packet latency of 85 ms. All code is Apache 2.0 and supports fine‑tuning and distillation.

Fish Speech: Established Multilingual TTS Benchmark

Fish Speech, developed by the Fish Audio team, is a large‑scale multilingual TTS system with 4 B parameters, trained on over 10 million hours of data and supporting roughly 50 languages.

It uses a dual autoregressive architecture (Slow AR + Fast AR) combined with GRPO reinforcement learning for alignment. On Chinese and English WER it reaches 0.54 % and 0.99 % respectively. Fine‑grained emotional control is possible via tags such as [laugh], [whispers], and [super happy].

Cloning a voice takes 10–30 seconds, and production‑grade streaming inference is enabled through SGLang. The repository has 30.8 k Stars; the license is CC‑BY‑NC‑SA‑4.0, requiring attention for commercial use.

SoulX‑Podcast: Designed for Podcast Scenarios

SoulX‑Podcast, from the Soul AI team, is a 1.7 B‑parameter system specialized for multi‑turn, multi‑speaker dialogue. It supports auxiliary language tags such as [laughter], [sigh], and [breath] to make generated conversations more natural.

The engine handles Mandarin, English, and regional dialects (Sichuan, Henan, Cantonese) and can perform zero‑shot voice cloning across dialects.

It can continuously generate over 90 minutes of multi‑speaker dialogue with stable timbre and context‑aware prosody. GitHub shows 3.4 k Stars and the project is Apache 2.0 licensed.

Supertonic 3: Ultra‑Lightweight Device‑Side Solution

Supertonic, from the Supertone team (now part of Hybe), offers a device‑side TTS system. Version 3 contains only 99 M parameters and supports 31 languages.

The core advantage is speed: on consumer hardware it achieves 167× real‑time speed via ONNX Runtime, without requiring a GPU. It runs on Python, Node.js, browsers (WebGPU), Java, C++, Swift, Flutter, and other platforms.

v3 adds ten expressive tags and outputs 44.1 kHz high‑quality audio. The repository has 12.3 k Stars and is MIT licensed, suitable for privacy‑sensitive local deployment.

Voicebox: One‑Stop AI Voice Studio

Voicebox is a locally‑first AI voice desktop application that integrates seven TTS engines (including Qwen3‑TTS, Kokoro, Chatterbox, LuxTTS, HumeAI TADA, etc.) and supports 23 languages.

It is not a single model but a “voice I/O full stack”: it can synthesize speech, perform speech‑to‑text transcription via Whisper, and includes a local LLM for text polishing. Features include voice cloning, multi‑track story editing, global transcription shortcuts, and MCP protocol integration for AI programming assistants like Claude Code.

Built with Tauri (Rust) for native performance on macOS, Windows, and Linux. The project has 30 k Stars and is MIT licensed.

Kokoro TTS: CLI Geek’s Favorite

Kokoro TTS is a command‑line text‑to‑speech tool based on an ONNX model with 82 M parameters, targeting terminal users.

It supports eight languages (including Mandarin), provides over 50 preset voices, and allows voice mixing (e.g., 60 % Sarah + 40 % Michael). Input formats include TXT, EPUB, PDF; it also supports streaming playback and chapter splitting.

Installation requires a single command: pip install kokoro-tts. After downloading the model, it is ready to use. The repository has 1.6 k Stars and is MIT licensed.

Pocket TTS: Lightweight CPU‑Only Option

Pocket TTS, developed by France’s Kyutai Lab, is a 100 M‑parameter model optimized for CPU execution.

On a MacBook Air M4 it reaches roughly 6× real‑time speed using only two CPU cores, with an initial frame latency of about 200 ms. It supports six languages (English, French, German, Portuguese, Italian, Spanish) and can run in browsers via WebAssembly.

The repository has 4.6 k Stars and is MIT licensed, making it suitable for edge devices, mobile, and browser scenarios.

Quick Project Overview

Qwen3‑TTS · 1.7 B params · 10 languages · Chinese support · 3 s voice cloning · 97 ms streaming · Apache 2.0 · ⭐ 11.9 k

dots.tts · 2 B params · 24+ languages · Chinese support · voice cloning · 85 ms streaming · Apache 2.0 · ⭐ ≈ 700

Fish Speech · 4 B params · ~50 languages · Chinese support · 10 s voice cloning · streaming · CC‑BY‑NC‑SA · ⭐ 30.8 k

SoulX‑Podcast · 1.7 B params · Mandarin, English, dialects · Chinese support · voice cloning · no streaming · Apache 2.0 · ⭐ 3.4 k

Supertonic 3 · 99 M params · 31 languages · no Chinese support · voice cloning · no streaming · MIT · ⭐ 12.3 k

Voicebox · Multi‑engine integration · 23 languages · Chinese support · voice cloning · streaming · MIT · ⭐ 30 k

Kokoro TTS · 82 M params · 8 languages · Chinese support · no cloning · streaming · MIT · ⭐ 1.6 k

Pocket TTS · 100 M params · 6 languages · no Chinese support · voice cloning · no streaming · MIT · ⭐ 4.6 k

Official Evaluation Data (Seed‑TTS‑Eval)

dots.tts (SCA) : Chinese 0.94 % / English 1.30 % / Difficult 6.60 % / Avg 2.95 %

Qwen3‑TTS (1.7B) : Chinese 1.22 % / English 1.23 % / Difficult 6.76 % / Avg 3.07 %

CosyVoice 3 : Chinese 1.12 % / English 2.22 % / Difficult 5.83 % / Avg 3.06 %

VoxCPM 2 : Chinese 0.97 % / English 1.84 % / Difficult 8.13 % / Avg 3.65 %

dots.tts achieves the best overall open‑source performance, while Qwen3‑TTS leads on the English metric.

How to Choose?

Prioritize audio quality and Chinese performance → dots.tts or Fish Speech. dots.tts tops the Seed‑TTS‑Eval benchmark; Fish Speech has the most mature ecosystem.

Need real‑time streaming → Qwen3‑TTS, with 97 ms end‑to‑end latency and character‑level instant output.

Building podcasts or multi‑speaker dialogue → SoulX‑Podcast, the only solution designed for multi‑turn conversations, supporting dialects and auxiliary tags.

Run locally on lightweight hardware → Supertonic 3, 99 M parameters, 31 languages, hardware‑agnostic.

Want an out‑of‑the‑box desktop app → Voicebox, which bundles seven engines with a single installer.

Command‑line power user / e‑book audio → Kokoro TTS, a one‑command setup.

Target edge devices or browsers → Pocket TTS, pure CPU execution with ~6× real‑time speed.

GitHub project links: Qwen3‑TTS: https://github.com/QwenLM/Qwen3-TTS dots.tts: https://github.com/rednote-hilab/dots.tts Fish Speech: https://github.com/fishaudio/fish-speech SoulX‑Podcast: https://github.com/Soul-AILab/SoulX-Podcast Supertonic: https://github.com/supertone-inc/supertonic Voicebox: https://github.com/jamiepine/voicebox Kokoro TTS: https://github.com/nazdridoy/kokoro-tts Pocket TTS: https://github.com/kyutai-labs/pocket-tts

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI open source benchmark multilingual speech synthesis text-to-speech

Written by

Geek Labs

Daily shares of interesting GitHub open-source projects. AI tools, automation gems, technical tutorials, open-source inspiration.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Qwen3‑TTS: Alibaba Tongyi Qianwen Team’s All‑Rounder

dots.tts: Xiaohongshu HiLab’s Fully Continuous AR Solution

Fish Speech: Established Multilingual TTS Benchmark

SoulX‑Podcast: Designed for Podcast Scenarios

Supertonic 3: Ultra‑Lightweight Device‑Side Solution

Voicebox: One‑Stop AI Voice Studio

Kokoro TTS: CLI Geek’s Favorite

Pocket TTS: Lightweight CPU‑Only Option

Quick Project Overview

Official Evaluation Data (Seed‑TTS‑Eval)

How to Choose?

Geek Labs

How this landed with the community

Was this worth your time?

0 Comments

Supertonic 3: Ultra‑Lightweight Device‑Side Solution