Artificial Intelligence 12 min read

Understanding the GIT Image‑to‑Text Model: Architecture, Examples, and Performance Comparison

The article introduces the GIT image‑to‑text (image captioning) model, explains its transformer‑based architecture, showcases multiple example outputs, discusses training details, compares its performance with Flamingo and COCO, and highlights its applicability to tasks such as VQA, video captioning, and image classification.

DataFunSummit

Oct 9, 2022

Understanding the GIT Image‑to‑Text Model: Architecture, Examples, and Performance Comparison

Guest Speaker: Wang Jianfeng, PhD, Microsoft (Principal Researcher, Cloud & AI).

Introduction: The GIT model is an image‑to‑text (image captioning) system built on a Transformer with self‑attention, capable of generating natural language descriptions directly from images without separate OCR.

Example Demonstrations:

Example 1 shows a cartoon scene where the model correctly identifies the characters and their dialogue.

Example 2 is a screenshot of a mobile phone; the model generates a complete sentence describing the time and date, proving it can read text without OCR.

Example 3 depicts a supermarket price tag; the model accurately extracts the price ($14.88) and currency.

Example 4 uses a manually created image with irregular text; the model still recognises the background and the two lines of text.

Example 5 contains artistic‑style lettering; most characters are correctly recognised except a single mis‑identified letter in the word “Markov”.

Example 6 shows text wrapped around a coin; the model successfully reads the non‑standard layout, confirming it does not rely on pre‑processing image rectification.

Model Architecture: The system consists of an Image Encoder and a Text Decoder. The Image Encoder can be a CNN (producing N×M feature maps) or a Transformer‑based encoder (producing token sequences). The Text Decoder uses a standard self‑attention Transformer; cross‑entropy loss was evaluated but self‑attention performed better for this multimodal task.

The Image Encoder employed is the Florence/CoSwin encoder, a contrastive‑pre‑trained model similar to CLIP but with an additional contrastive loss to reduce false positives.

The Text Decoder is randomly initialised; pre‑trained language models (e.g., BERT) did not improve results, likely because the textual output in vision‑language tasks is short.

Training uses a token‑by‑token generation objective with Cross‑Entropy loss. Despite being designed for image captioning, the model also works for VQA (question‑answer pairs) and video captioning (encoding six frames with temporal embeddings).

Performance Comparison: Compared with Flamingo and COCO models, GIT is much smaller yet achieves competitive or superior results on several tasks, especially on the TextCaps benchmark where it surpasses human performance. It also performs well on image classification without needing predefined class vocabularies, and on scene‑text recognition with an average accuracy of 92.9%.

Conclusions: GIT demonstrates strong image‑to‑text generation, OCR‑like text recognition, and versatility across vision‑language tasks, establishing new SOTA on 12 benchmarks and offering a vocabulary‑free classification approach.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Multimodal AI model comparison vision-language Image Captioning GIT model

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.