Artificial Intelligence 10 min read

What Makes OpenAI’s New GPT‑4o a Game‑Changing Multimodal AI?

OpenAI’s latest flagship model GPT‑4o combines text, audio, image and video processing in a single, faster, cheaper multimodal system that delivers near‑human response times, expanded API access, and new safety measures, reshaping how developers and users interact with AI.

21CTO

May 14, 2024

What Makes OpenAI’s New GPT‑4o a Game‑Changing Multimodal AI?

OpenAI unveiled GPT‑4o, its newest flagship multimodal model whose "o" stands for "omni," enabling the system to accept and generate text, audio, image, and video inputs.

The model delivers GPT‑4‑level intelligence to all users, including free accounts, and introduces a macOS desktop app for Plus users with broader rollout planned.

Key Technical Highlights

Multimodal Capability : Handles text, audio, and images, processing them end‑to‑end within a single neural network.

Real‑time Audio Response : Responds to audio in as little as 232 ms (average 320 ms), matching human conversational latency.

Speed and Cost Efficiency : Generates text twice as fast as GPT‑4 Turbo, costs 50 % less, and offers five‑fold higher rate limits via the API.

Token Compression : New tokenizer reduces token count across languages, improving throughput.

Advanced Vision : Interprets images, answers visual questions, and understands object relationships, useful for healthcare, retail, and security.

Multilingual Improvements : Significantly better performance on non‑English languages.

Safety and Availability

Text and image features are immediately available to free and Plus ChatGPT users with limits five times higher than previous versions; voice mode will enter an alpha test for Plus users in coming weeks. API users can access text and visual capabilities, with audio/video initially limited to a small set of partners.

OpenAI acknowledges new risks from real‑time audio and visual inputs and is restricting certain voice outputs to specific synthetic voices to mitigate impersonation abuse.

Compatibility and Integration

API access allows developers to embed GPT‑4o’s capabilities into applications.

Supported on OpenAI Playground, ChatGPT web UI, and upcoming macOS desktop client.

Comparison with Competitors

Benchmark tests show GPT‑4o outperforming GPT‑4T, Claude 3 Opus, Gemini Pro 1.5, Gemini Ultra 1.0, and Llama 3 400B on text, math, and coding evaluations.

User Benefits

More natural, multimodal interaction.

Reduced costs and faster responses.

Versatile tool for customer service, content creation, and data analysis.

Future Outlook

OpenAI plans to expand voice and video capabilities, integrate with Apple devices, and continue refining safety measures. CEO Sam Altman described the new modes as the best computer interface he’s experienced, likening it to the AI in the movie "Her," while noting that hallucinations remain a challenge.

Author: 校长 References: https://blog.samaltman.com/gpt-4o https://www.cmswire.com/digital-marketing/openais-gpt4o-smarter-faster-and-it-speaks/ https://woy.ai/p/GPT4o https://www.theregister.com/2024/05/13/openai_gpt4o/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Multimodal AI OpenAI AI model vision-language Audio Processing GPT-4o

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.