Artificial Intelligence 8 min read

OpenAI Unveils New STT and TTS Models: gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-4o-mini-tts – Performance, Pricing, and Demo

OpenAI announced three new speech models—two STT models (gpt-4o-transcribe and its lightweight gpt-4o-mini-transcribe) and one TTS model (gpt-4o-mini-tts)—showcasing strong accuracy on multilingual benchmarks, competitive pricing, and a quick‑start API demo for developers.

DataFunTalk

Mar 21, 2025

OpenAI Unveils New STT and TTS Models: gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-4o-mini-tts – Performance, Pricing, and Demo

OpenAI surprised the community with a late‑night live stream announcing three new speech models: the high‑performance STT model gpt-4o-transcribe , its smaller‑parameter counterpart gpt-4o-mini-transcribe , and the new TTS model gpt-4o-mini-tts . All three are available via API.

The STT models work like Whisper, converting audio to text, and automatically perform noise reduction and speaker filtering. On the multilingual FLEURS benchmark they achieve lower Word Error Rate (WER) than previous OpenAI models, with especially good results on most languages except Chinese, which remains challenging.

The TTS model gpt-4o-mini-tts produces natural‑sounding English speech and, while its Chinese output is still rough, it demonstrates the model’s ability to generate expressive audio with configurable emotion presets (VOICE, VIBE, etc.). A short demo site (https://www.openai.fm/) lets users try the model for free.

Pricing is modest: gpt-4o-transcribe costs about $0.006 per minute (≈ ¥0.04), gpt-4o-mini-transcribe $0.003 per minute (≈ ¥0.02), and gpt-4o-mini-tts $0.015 per minute (≈ ¥0.1), making them cheaper than many competing services such as 11labs or Minimax.

Voice: High-energy, upbeat, and encouraging, projecting enthusiasm and motivation.</code>
<code>Punctuation: Short, punchy sentences with strategic pauses to maintain excitement and clarity.</code>
<code>Delivery: Fast-paced and dynamic, with rising intonation to build momentum and keep engagement high.</code>
<code>Phrasing: Action-oriented and direct, using motivational cues to push participants forward.</code>
<code>Tone: Positive, energetic, and empowering, creating an atmosphere of encouragement and achievement.

Developers can integrate the models with just a few lines of code (around ten lines) using the OpenAI audio API (https://platform.openai.com/docs/guides/audio). The article concludes with a recommendation to use gpt-4o-mini-transcribe for cost‑effective English STT and to consider alternative Chinese TTS services like Minimax for better Mandarin quality.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

OpenAI pricing GPT-4o text-to-speech AI models speech-to-text

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.