Vision Banana: Turning Image Generation Models into Generalist Vision Learners
Vision Banana shows that large‑scale image‑generation models can be instruction‑tuned to perform zero‑shot visual‑understanding tasks such as semantic segmentation, instance segmentation, depth and normal estimation, achieving or surpassing specialist SOTA results while preserving their original generative capabilities.
