Artificial Intelligence 6 min read

Microsoft Announces Multimodal GPT-4: A New ‘iPhone Moment’ for AI

Microsoft Germany's CTO announced the imminent release of a multimodal GPT‑4, highlighting its ability to process text, images and video, while executives liken the breakthrough to an “iPhone moment” for AI, emphasizing new capabilities, industry disruption, and responsible data use.

21CTO

Mar 11, 2023

Microsoft Announces Multimodal GPT-4: A New ‘iPhone Moment’ for AI

Multimodal GPT-4 Announcement

On March 9, Microsoft Germany CTO Andreas Braun announced at the “AI in Focus – Digital Kickoff” event that GPT-4 will be released next week with multimodal capabilities, following the earlier launch of Kosmos-1 and ongoing fine-tuning with OpenAI.

Fortune reported that OpenAI’s beta GPT-4 uses a stronger large language model, not by dramatically increasing parameters but by improving other aspects, and that the company is also developing a text-to-video AI model.

In January, OpenAI CEO Sam Altman dismissed rumors of a 100‑trillion‑parameter GPT-4, suggesting the next model will seek enhancements beyond sheer size.

Shift to Multimodal, Disruptive Impact

Braun described multimodal GPT-4 as a “game‑changing” model that can understand and generate across languages and modalities, including video, making AI “comprehensive.” The move to multimodal inputs and outputs is expected to be highly disruptive, building on earlier work such as DALL‑E 2 and CLIP.

Microsoft’s own multimodal model, Kosmos-1, can process text and images simultaneously and plans to incorporate audio and video in the future.

On March 8, Microsoft unveiled Visual ChatGPT, which integrates various visual foundation models to enable users to send and receive images, handle complex visual tasks, and provide feedback for iterative refinement.

Send and receive both language and images.

Address complex visual questions or editing instructions requiring multiple AI models and steps.

Provide feedback and request corrections.

Researchers noted in an arXiv pre‑print that while ChatGPT excels at language interaction, it cannot yet process or generate images, whereas visual models like Visual Transformers or Stable Diffusion are specialized “experts” with fixed input‑output formats.

A New “iPhone Moment” for AI

During the event, Braun and Microsoft Germany CEO Marianne Janik called the multimodal breakthrough an “iPhone moment,” emphasizing that AI will create value rather than replace jobs, and that companies should build internal capability centers to train staff and generate bundled project ideas.

Janik stressed the need for experts to realize AI’s value, the emergence of new roles, and Microsoft’s policy of not using customer data to train its models, while also promoting democratization of AI through Azure, Outlook, and Teams.

The capabilities of the upcoming GPT-4 remain eagerly anticipated.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Multimodal AI large language models ai-development Microsoft GPT-4

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.