Artificial Intelligence 32 min read

Vision Banana: Turning Image Generation Models into Generalist Vision Learners

Vision Banana shows that large‑scale image‑generation models can be instruction‑tuned to perform zero‑shot visual‑understanding tasks such as semantic segmentation, instance segmentation, depth and normal estimation, achieving or surpassing specialist SOTA results while preserving their original generative capabilities.

CodeTrend

Jun 12, 2026

Vision Banana: Turning Image Generation Models into Generalist Vision Learners

Overview

The paper argues that the long‑standing divide between generative ("painting") and discriminative ("understanding") models in computer vision is unnecessary. By lightly instruction‑tuning the image‑generation model Nano Banana Pro (NBP), Vision Banana demonstrates that a single model can excel at multiple visual‑understanding tasks without architectural changes.

Background and Motivation

Traditional visual representation learning relies on supervised classification, contrastive learning, self‑distillation, or auto‑encoding, all of which produce task‑specific discriminative features. Recent generative models (e.g., FLUX, Gemini Imagen) have shown remarkable ability to synthesize high‑fidelity images with precise semantic control, suggesting they already encode rich visual knowledge.

Two observations motivate Vision Banana:

State‑of‑the‑art generators can produce visualizations that resemble outputs of vision tasks, but lack precise formatting for quantitative evaluation.

Fine‑tuning specialist models on specific tasks yields high performance but sacrifices generality.

Vision Banana adopts a third path: lightweight instruction tuning.

Methodology

Base model : Nano Banana Pro, a Google DeepMind diffusion model trained on massive image data, retains strong generative abilities.

Instruction‑tuning strategy : Merge a tiny fraction of task‑specific data with the original training mix and jointly train. This preserves generation while teaching the model to output results as RGB images.

Key advantages :

Unified weights – the same model handles all tasks by changing the prompt.

Data efficiency – only a small amount of task data is needed.

Capability retention – generation quality is not degraded.

Unified output format : Every task is expressed as an RGB image that can be deterministically decoded back to the original task output (semantic mask, depth map, normal map, etc.).

RGB Bijection Schemes

Depth estimation uses a two‑step power‑transform (f(d,λ,c)=1−(1−d/(λ·c))^(λ+1) with λ=‑3, c=10/3) followed by interpolation along the RGB cube edges, ensuring a reversible mapping.

Surface normals map directly: R = trunc((1‑x)/2·255), G = trunc((1+y)/2·255), B = trunc((1+z)/2·255).

Segmentation encodes each class with a user‑specified color; instance segmentation assigns unique colors automatically and extracts masks via color clustering.

2D Understanding Results (Zero‑Shot)

Semantic segmentation (Cityscapes) : Vision Banana achieves 69.9 mIoU, surpassing SAM 3 (65.2 mIoU) and outperforming all zero‑shot baselines.

Instance segmentation (SA‑Co/Gold) : With Gemini 3.1 Flash‑Lite for positive‑sample detection, Vision Banana reaches 47.5 cgF1 and 0.84 IL_MCC, close to SAM 3 + Llama 3.2 (0.86 IL_MCC) and far above OWLv2.

Referring expression segmentation : Vision Banana scores 73.8 cIoU on RefCOCOg, exceeding SAM 3 + Gemini 2.5 Pro (73.4 cIoU) and achieving 79.3 gIoU on ReasonSeg when paired with Gemini 2.5 Pro.

Depth estimation (four benchmarks) : Average δ₁ = 0.929, beating Depth Anything V3 (0.918) and UniK3D (0.823).

Surface normal estimation (three indoor datasets) : Mean angular error = 15.7°, better than Lotus‑2 (17.3°) and comparable to specialized methods.

Prompt example – semantic segmentation
"Generate a visualization image of semantic segmentation, using this color mapping: {
  \"cat ears\": <255,165,0>,
  \"exit sign\": <0,0,255>,
  \"background\": <125,0,125>
}"

Prompt example – metric depth estimation
"Predict the metric depth of this scene as an image. Visualized in the rainbow (black‑red‑yellow‑green‑cyan‑blue‑violet‑white) color palette."

3D Reasoning

Vision Banana infers metric depth and surface normals from a single RGB image without using camera intrinsics. On outdoor benchmarks (ETH3D, DIODE, KITTI) it attains δ₁ scores of 0.935, 0.838, and 0.865 respectively, demonstrating absolute‑scale perception learned from generative pre‑training.

Qualitative outdoor test near the Golden Pavilion in Japan shows a predicted depth of 13.71 m versus 12.87 m measured by Google Maps (≈6.5 % relative error).

Generation Retention

On the GenAI‑Bench text‑to‑image benchmark Vision Banana wins 53.5 % of pairwise comparisons against the base NBP, and on ImgEdit it ties with NBP (47.8 %). This confirms that instruction tuning does not cause catastrophic forgetting.

Limitations

Instance‑segmentation performance still lags behind fully supervised models when large annotated datasets are available.

RGB‑8‑bit encoding limits the precision of continuous outputs such as fine‑grained depth.

Inference cost of large diffusion models is substantially higher than specialist encoders, hindering real‑time deployment.

Deterministic reproducibility can be challenged by the multimodal nature of generative models.

Nano Banana Pro is not open‑source, making exact replication difficult.

Conclusion and Outlook

Vision Banana proves that image‑generation pre‑training serves as a universal foundation for visual understanding, mirroring the role of LLM pre‑training in language. By treating visual tasks as image‑generation problems and using lightweight instruction tuning, a single model can achieve SOTA on diverse benchmarks while retaining generative quality.

Future research directions include expanding to more tasks (optical flow, pose estimation), incorporating multi‑view or video inputs, tighter vision‑language integration, model compression for faster inference, and developing controllable multimodal output mechanisms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

image generation Instruction Tuning multimodal models visual understanding Vision Banana zero-shot transfer RGB encoding

Written by

CodeTrend

Capture the daily pulse of global open-source tech. Real-time tracking of GitHub Trending and curated selections of the hottest projects worldwide, including C++, Python and other verticals. Avoid information overload and keep tech trends within reach.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.