Artificial Intelligence 4 min read

How Visual ChatGPT Adds Image Interaction to ChatGPT – A Deep Dive

Microsoft's open‑source Visual ChatGPT extends ChatGPT with image send/receive capabilities, explains its multimodal architecture, demo scenarios, used visual models, and points to the arXiv paper, highlighting its rapid popularity growth on GitHub.

Programmer DD

Mar 19, 2023

How Visual ChatGPT Adds Image Interaction to ChatGPT – A Deep Dive

Overview

Microsoft recently open‑sourced Visual ChatGPT, a multimodal extension of ChatGPT that can send and receive images during a conversation.

Why it matters

While ChatGPT excels at text, Visual ChatGPT adds "custom emoji"‑like image capabilities, expanding its fun and practical applications.

Architecture

ChatGPT (or any LLM) acts as a general interface, handling user interaction and delegating visual tasks to specialized foundation models (VFM). The repository provides diagrams of the system architecture.

Demo scenarios

The demo showcases three interaction types: Visual ChatGPT receiving an image from the user, modifying an image based on textual instructions and sending it back, and recognizing an image to answer questions. The system decides whether to invoke a Visual Foundation Model for each request.

Image models and resource usage

The repository lists the visual models used by Visual ChatGPT and their GPU memory consumption.

How Visual ChatGPT Adds Image Interaction to ChatGPT – A Deep Dive

Overview

Why it matters

Architecture

Demo scenarios

Image models and resource usage

Further reading

Programmer DD

How this landed with the community

Was this worth your time?

0 Comments