Artificial Intelligence 12 min read

Understanding the GIT Image‑to‑Text Model: Architecture, Examples, and Performance Comparison

The article introduces the GIT image‑to‑text (image captioning) model, explains its transformer‑based architecture, showcases multiple example outputs, discusses training details, compares its performance with Flamingo and COCO, and highlights its applicability to tasks such as VQA, video captioning, and image classification.

DataFunSummit
DataFunSummit
DataFunSummit
Understanding the GIT Image‑to‑Text Model: Architecture, Examples, and Performance Comparison

Guest Speaker: Wang Jianfeng, PhD, Microsoft (Principal Researcher, Cloud & AI).

Introduction: The GIT model is an image‑to‑text (image captioning) system built on a Transformer with self‑attention, capable of generating natural language descriptions directly from images without separate OCR.

Example Demonstrations:

Example 1 shows a cartoon scene where the model correctly identifies the characters and their dialogue.

Example 2 is a screenshot of a mobile phone; the model generates a complete sentence describing the time and date, proving it can read text without OCR.

Example 3 depicts a supermarket price tag; the model accurately extracts the price ($14.88) and currency.

Example 4 uses a manually created image with irregular text; the model still recognises the background and the two lines of text.

Example 5 contains artistic‑style lettering; most characters are correctly recognised except a single mis‑identified letter in the word “Markov”.

Example 6 shows text wrapped around a coin; the model successfully reads the non‑standard layout, confirming it does not rely on pre‑processing image rectification.

Model Architecture: The system consists of an Image Encoder and a Text Decoder. The Image Encoder can be a CNN (producing N×M feature maps) or a Transformer‑based encoder (producing token sequences). The Text Decoder uses a standard self‑attention Transformer; cross‑entropy loss was evaluated but self‑attention performed better for this multimodal task.

The Image Encoder employed is the Florence/CoSwin encoder, a contrastive‑pre‑trained model similar to CLIP but with an additional contrastive loss to reduce false positives.

The Text Decoder is randomly initialised; pre‑trained language models (e.g., BERT) did not improve results, likely because the textual output in vision‑language tasks is short.

Training uses a token‑by‑token generation objective with Cross‑Entropy loss. Despite being designed for image captioning, the model also works for VQA (question‑answer pairs) and video captioning (encoding six frames with temporal embeddings).

Performance Comparison: Compared with Flamingo and COCO models, GIT is much smaller yet achieves competitive or superior results on several tasks, especially on the TextCaps benchmark where it surpasses human performance. It also performs well on image classification without needing predefined class vocabularies, and on scene‑text recognition with an average accuracy of 92.9%.

Conclusions: GIT demonstrates strong image‑to‑text generation, OCR‑like text recognition, and versatility across vision‑language tasks, establishing new SOTA on 12 benchmarks and offering a vocabulary‑free classification approach.

multimodal AITransformermodel comparisonVision-Languageimage captioningGIT model
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.