Artificial Intelligence 14 min read

Inside Artificial Analysis: Independent AI Voice Benchmarks for ASR, TTS, and Speech‑to‑Speech

Artificial Analysis provides an independent, reproducible benchmarking platform for voice AI, offering objective WER scores for ASR, Elo‑based blind‑listening scores for TTS, and three‑dimensional metrics for end‑to‑end speech dialogue, together with detailed methodology, top‑model rankings, and practical guidance for developers to choose the most suitable model and provider for their scenarios.

Weekly Large Model Application

Jun 23, 2026

Inside Artificial Analysis: Independent AI Voice Benchmarks for ASR, TTS, and Speech‑to‑Speech

Overview

Artificial Analysis (artificialanalysis.ai) is an independent third‑party benchmarking platform for AI voice models. It evaluates three separate capability lines – speech‑to‑text (ASR), text‑to‑speech (TTS), and end‑to‑end speech dialogue (speech‑to‑speech) – each of which can be compared independently, combined into pipelines, or replaced by a single end‑to‑end model.

Evaluation Dimensions

ASR : word error rate (AA‑WER), streaming latency, price.

TTS : blind‑listening Elo score, synthesis speed, price.

Speech‑to‑Speech : audio‑question → audio‑answer correctness (reasoning), dialogue fluency, first‑word latency.

ASR Evaluation (AA‑WER v2)

The core metric is AA‑WER, defined as (substitutions + insertions + deletions) ÷ total reference words. Test data comprise roughly 8 hours of English audio weighted across three datasets:

AA‑AgentTalk – 50 % – voice‑agent commands (customer service, smart‑home, etc.).

VoxPopuli‑Cleaned‑AA – 25 % – European Parliament speeches with multiple accents.

Earnings22‑Cleaned‑AA – 25 % – earnings‑call recordings containing technical terminology and overlapping speakers.

Two leaderboards are provided: offline batch processing (/speech-to-text/batch) and streaming transcription (/speech-to-text/streaming). Streaming measures both “audio‑end → final transcript” latency and “audio‑end → first partial” latency.

Top‑5 ASR Accuracy (offline)

1️⃣ Fun‑Realtime‑ASR‑preview (Alibaba Cloud) – AA‑WER 1.7 %.

2️⃣ Scribe v2 (ElevenLabs) – AA‑WER 2.2 %.

3️⃣ MAI‑Transcribe‑1.5 (Microsoft Azure) – AA‑WER 2.4 %.

4️⃣ MAI‑Transcribe‑1 (Microsoft Azure) – AA‑WER 2.6 %.

5️⃣ Gemini 3 Pro (High) (Google) – AA‑WER 2.7 %.

57 models participated. The 1.7 % error rate of Fun‑Realtime‑ASR‑preview is comparable to professional transcription quality.

Speed and Cost (ASR)

Accuracy leader : Fun‑Realtime‑ASR‑preview – AA‑WER 1.7 %.

Speed leader : Parakeet TDT 0.6B V3 (Together AI) – 996.8× real‑time (≈ 996 s audio per second).

Price leader : Modulate STT Batch VFast – $0.417 / 1000 min.

Open‑source best : Voxtral Small (Mistral) – AA‑WER 2.8 %.

Selection guidance: choose the model that aligns with the primary priority – highest accuracy, fastest throughput, or lowest cost.

TTS Evaluation (Speech Arena + Elo)

Quality is measured via a blind‑listening arena similar to Chatbot Arena. Users hear two audio samples of the same text and vote for the more natural one; an Elo algorithm converts votes into a ranking. Each model is evaluated on eight voice‑style combinations (male/female × American/English) at a uniform 22.05 kHz sample rate.

Top‑5 TTS Quality (Elo)

1️⃣ Fun‑Realtime‑TTS (Alibaba Cloud) – Elo 1225–1226.

2️⃣ Gemini 3.1 Flash TTS (Google) – Elo 1215–1220.

3️⃣ Realtime TTS‑2 Research Preview (Inworld) – Elo 1208–1213.

4️⃣ Sonic 3.5 (Cartesia) – Elo 1197–1203.

5️⃣ xAI Text to Speech (xAI) – Elo 1196–1200.

The Elo spread among the top five is only ~24 points, indicating intense competition. Fun‑Realtime‑TTS costs about $27.6 per million characters and supports real‑time synthesis, voice cloning, multilingual output, and dialects.

Open‑Source, Speed, and Price (TTS)

Open‑source quality leader : Fish Audio S2 Pro – Elo 1118.

Speed leader : Polly Standard (Amazon) – 1205 characters / second.

Price leader : Kokoro 82M v1.0 (Replicate) – $0.65 / million characters.

Best quality‑price frontier : Fun‑Realtime‑TTS and Gemini 3.1 Flash TTS.

Leaderboards can be filtered by scenario (knowledge sharing, assistants, entertainment, customer service) and accent.

End‑to‑End Speech Dialogue (Speech‑to‑Speech) Evaluation

When cascade architectures (ASR + LLM + TTS) are replaced by end‑to‑end models, Artificial Analysis provides a dedicated /speech-to-speech leaderboard that measures “audio‑in → audio‑out” directly.

Core Metrics

Speech Reasoning (Big Bench Audio): audio‑question → audio‑answer correctness.

Dialogue Fluency (Full Duplex Bench): turn‑taking, pauses, interruptions, back‑channeling.

First‑Word Latency (Big Bench Audio subset): time from end of input to model’s first output.

Top‑5 Models (combined metrics)

Fun‑Realtime‑Audiochat (Alibaba Cloud) – Reasoning 97.6 %, Fluency 97.8 %, Latency 1.39 s.

Step‑Audio R1.1 (Step‑Star) – Reasoning 97.6 %, Latency 1.51 s (fluency not reported).

GPT‑Realtime‑2 (High) (OpenAI) – Reasoning 97 %, Fluency 95.3 %, Latency 2.33 s.

Gemini 3.1 Flash Live (High) (Google) – Reasoning 97 %, Latency 2.98 s (fluency not reported).

Gemini 2.5 Flash Native Audio (Google) – Reasoning 69 %, Latency 0.63 s.

Grok Voice Agent (xAI) – Reasoning 93 %, Fluency 71.6 %, Latency 0.78 s.

Key insight: high reasoning accuracy does not guarantee smooth dialogue or low latency. Model choice should be driven by whether “smart”, “fast”, or “human‑like” behavior is most important for the target use case.

Case Study – Alibaba Fun‑Realtime Series "Grand Slam"

ASR (listen) : Fun‑Realtime‑ASR‑preview – World #1 – AA‑WER 1.7 %.

TTS (speak) : Fun‑Realtime‑TTS – World #1 – Elo 1226.

Speech‑to‑Speech (chat) : Fun‑Realtime‑Audiochat – Reasoning #1 & Fluency #1 – 97.6 % / 97.8 %.

These rankings are specific to the Artificial Analysis platform, its metrics, and the snapshot in time; other leaderboards (e.g., HuggingFace TTS Arena) may show different results.

How Developers Can Use the Platform

Identify the target scenario (real‑time customer service, offline transcription, dubbing, multilingual support, etc.).

Open the relevant leaderboard and locate Pareto‑optimal points on the “accuracy vs price” or “quality vs speed” scatter plots.

Click a model’s Details to compare API providers – the same model from different vendors can differ up to 5× in speed or cost.

Watch the Changelog – rankings update almost weekly, so conclusions have limited shelf‑life.

For Chinese or dialect use‑cases, note that AA benchmarks are English‑centric; additional local testing is required.

Key Takeaways

Artificial Analysis provides objective ASR evaluation (AA‑WER) and subjective TTS evaluation (Elo), plus a three‑dimensional assessment for end‑to‑end speech dialogue.

Current ASR leader: Fun‑Realtime‑ASR‑preview with 1.7 % WER.

Current TTS leader: Fun‑Realtime‑TTS with Elo 1226.

End‑to‑end dialogue leader on reasoning and fluency: Fun‑Realtime‑Audiochat (97.6 % / 97.8 %).

Rankings evolve rapidly; model selection must be based on concrete scenario testing rather than static “world‑first” claims.

Reference URLs

Home page: https://artificialanalysis.ai/

ASR leaderboard: https://artificialanalysis.ai/speech-to-text

ASR streaming: https://artificialanalysis.ai/speech-to-text/streaming

TTS leaderboard: https://artificialanalysis.ai/text-to-speech/leaderboard

TTS blind‑listening arena: https://artificialanalysis.ai/text-to-speech/arena

Speech‑to‑Speech leaderboard: https://artificialanalysis.ai/speech-to-speech

ASR methodology details: https://artificialanalysis.ai/speech-to-text/methodology

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

benchmark TTS model selection ASR Artificial Analysis speech-to-speech AI voice evaluation

Written by

Weekly Large Model Application

Sharing to add value to technology

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Overview

Evaluation Dimensions

ASR Evaluation (AA‑WER v2)

Top‑5 ASR Accuracy (offline)

Speed and Cost (ASR)

TTS Evaluation (Speech Arena + Elo)

Top‑5 TTS Quality (Elo)

Open‑Source, Speed, and Price (TTS)

End‑to‑End Speech Dialogue (Speech‑to‑Speech) Evaluation

Core Metrics

Top‑5 Models (combined metrics)

Case Study – Alibaba Fun‑Realtime Series "Grand Slam"

How Developers Can Use the Platform

Key Takeaways

Reference URLs

Weekly Large Model Application

How this landed with the community

Was this worth your time?

0 Comments

TTS Evaluation (Speech Arena + Elo)