OpenAI Unveils New STT and TTS Models: gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-4o-mini-tts – Performance, Pricing, and Demo
OpenAI announced three new speech models—two STT models (gpt-4o-transcribe and its lightweight gpt-4o-mini-transcribe) and one TTS model (gpt-4o-mini-tts)—showcasing strong accuracy on multilingual benchmarks, competitive pricing, and a quick‑start API demo for developers.
OpenAI surprised the community with a late‑night live stream announcing three new speech models: the high‑performance STT model gpt-4o-transcribe , its smaller‑parameter counterpart gpt-4o-mini-transcribe , and the new TTS model gpt-4o-mini-tts . All three are available via API.
The STT models work like Whisper, converting audio to text, and automatically perform noise reduction and speaker filtering. On the multilingual FLEURS benchmark they achieve lower Word Error Rate (WER) than previous OpenAI models, with especially good results on most languages except Chinese, which remains challenging.
The TTS model gpt-4o-mini-tts produces natural‑sounding English speech and, while its Chinese output is still rough, it demonstrates the model’s ability to generate expressive audio with configurable emotion presets (VOICE, VIBE, etc.). A short demo site (https://www.openai.fm/) lets users try the model for free.
Pricing is modest: gpt-4o-transcribe costs about $0.006 per minute (≈ ¥0.04), gpt-4o-mini-transcribe $0.003 per minute (≈ ¥0.02), and gpt-4o-mini-tts $0.015 per minute (≈ ¥0.1), making them cheaper than many competing services such as 11labs or Minimax.
Voice: High-energy, upbeat, and encouraging, projecting enthusiasm and motivation.
Punctuation: Short, punchy sentences with strategic pauses to maintain excitement and clarity.
Delivery: Fast-paced and dynamic, with rising intonation to build momentum and keep engagement high.
Phrasing: Action-oriented and direct, using motivational cues to push participants forward.
Tone: Positive, energetic, and empowering, creating an atmosphere of encouragement and achievement.Developers can integrate the models with just a few lines of code (around ten lines) using the OpenAI audio API (https://platform.openai.com/docs/guides/audio). The article concludes with a recommendation to use gpt-4o-mini-transcribe for cost‑effective English STT and to consider alternative Chinese TTS services like Minimax for better Mandarin quality.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.