Building an Open‑Source TTS Evaluation Framework with ZipVoice, OmniVoice & Latest Benchmarks
This guide explains why TTS evaluation requires a three‑metric “iron triangle” (WER/CER, speaker similarity, and naturalness), introduces community benchmarks such as Seed‑TTS‑eval, TTSDS2, TTS Arena and TTSD‑eval, and provides a concrete six‑stage pipeline and best‑practice checklist for reproducible, production‑ready assessment.
Why TTS is harder to measure than ASR – ASR uses a single community‑agreed metric (WER/CER). TTS must evaluate intelligibility (WER/CER via ASR listening), speaker similarity (cosine similarity of speaker embeddings), naturalness (UTMOS or subjective MOS), and, for streaming scenarios, first‑packet latency and real‑time factor.
Three‑Metric Iron Triangle
Intelligibility: synthesize wav → fixed ASR model transcribe → compare with reference text (WER for English, CER for Chinese). The ASR backend must match the one used in benchmark papers.
Speaker similarity (SIM): extract speaker embeddings from reference and synthesized audio and compute cosine similarity. ZipVoice uses k2-fsa/TTS_eval_models with ECAPA‑TDNN + WavLM; OmniVoice documents the same combination.
Naturalness (UTMOS): a neural network predicts MOS without human listening, suitable for CI regression; occasional human MOS/CMOS calibration is still recommended. The TTSDS2 paper (arXiv:2506.19441) reports a Spearman correlation of ~0.67 with human MOS, better than older metrics.
Community Benchmarks (2025‑2026)
Seed‑TTS‑eval : zero‑shot benchmark defined by ByteDance, providing English WER (Whisper), Chinese CER (Paraformer), and speaker SIM. Example: Qwen3‑TTS 1.7B achieves EN WER 1.5 % and ZH CER 1.33 %.
TTSDS / TTSDS2 : open‑source Python package ( ttsds v2.1.0, 2025) that weights intelligibility, speaker identity, and naturalness; multilingual, quarterly updated test sets; leaderboard at https://ttsdsbenchmark.com.
TTS Arena : subjective ELO ranking based on blind A/B listening, reflecting perceived quality rather than WER. Notable models such as Kokoro 82M rank highly, sometimes diverging from objective scores.
TTSD‑eval : OpenMOSS repository at https://github.com/OpenMOSS/TTSD-eval for dialogue TTS, measuring ACC, SIM, and WER using MMS‑FA alignment and wespeaker embeddings. MOSS‑TTSD v1.0 reports ZH ACC 95.87 % and ZH WER 4.85 %.
Test‑Set Zoo
Four levels of test sets are recommended:
L1 – quick regression: 50 English and 50 Chinese samples from Seed‑TTS.
L2 – full zero‑shot standard: complete Seed‑TTS‑eval test set plus LibriSpeech‑PC.
L3 – multilingual extension: FLEURS / MiniMax Multilingual (supported by OmniVoice).
L4 – dialogue scenario: 100 multi‑turn conversations (30 s–720 s) from TTSD‑eval.
ZipVoice’s run_eval.sh script automatically downloads k2-fsa/TTS_eval_datasets and k2-fsa/TTS_eval_models, providing a plug‑and‑play evaluation asset bundle.
Six‑Stage Objective Pipeline
Stage 1 – Test set & manifest : create test.tsv with utt_id, reference text, reference audio path, speaker ID, and fixed sampling rate (16 kHz or 24 kHz).
Stage 2 – Batch synthesis : run inference with unified hyper‑parameters (seed, temperature, top_p, streaming flag) and output results/{testset}/*.wav matching utt_id.
Stage 3‑5 – Metric computation :
SIM: zipvoice.eval.speaker_similarity.sim WER/CER: hubert (LibriSpeech‑PC) or Seed‑TTS back‑ends
UTMOS: zipvoice.eval.mos.utmos Stage 6 – Aggregation & ranking : store JSON/CSV with dataset, model_id, WER, SIM, UTMOS, date, git_hash; optionally compute TTSDS2 composite score and combine with subjective Arena results.
Real‑Time Extra Metrics
For streaming models (e.g., MOSS‑TTS), report first‑packet audio latency (TTFA) and streaming RTF in addition to WER/SIM.
Pitfalls & Best‑Practice Rules
Never change the ASR listening model when comparing against Seed‑TTS official tables.
Do not mix speaker embeddings from different networks (WavLM vs ECAPA) in a single SIM report.
Always report WER/CER alongside UTMOS; “good‑sound‑only” scores are misleading for production.
Report Chinese results as CER and English as WER to match Seed‑TTS‑eval column names.
Fix reference audio slice length and apply normalization/trim‑silence for zero‑shot tests.
Separate streaming and offline leaderboards.
Seven‑Step Implementation Checklist
Fork ZipVoice run_eval.sh as a team template; lock Hugging Face dataset and evaluation model versions.
Wrap your own TTS as an inference module that outputs wav files and a manifest matching test.tsv.
Run a smoke test on 20 English and 20 Chinese Seed‑TTS samples to verify all three metrics.
Configure CI: PR triggers L1; release triggers L2 full run and JSON archiving.
Quarterly sample 20‑50 utterances for human MOS/CMOS to calibrate UTMOS bias.
For dialogue products, add TTSD‑eval; optionally rank overall quality with TTSDS2.
Summary
The essential TTS evaluation triangle—WER/CER + SIM + UTMOS/MOS—must be measured together. Zero‑shot factual standard is Seed‑TTS‑eval; comprehensive objective score is provided by TTSDS2; subjective quality is captured by TTS Arena. For dialogue TTS, ACC from TTSD‑eval is added. The recommended engineering stack is k2-fsa/TTS_eval_datasets, k2-fsa/TTS_eval_models, and ZipVoice evaluation scripts, following the six‑stage pipeline and the seven‑step checklist to ensure reproducibility and production readiness.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
