Can a Text‑to‑Image Model Replace Traditional Vision Tools? Nano Banana Pro Zero‑Shot Test
This article evaluates the Nano Banana Pro text‑to‑image model, built on Gemini 3 Pro, across fourteen low‑level vision tasks and forty datasets using only prompts without fine‑tuning, revealing strong perceptual quality but weak pixel‑level metrics, and highlighting both its generative strengths and failure modes such as hallucinations and color shifts.
Background
Since 2023 text-to-image models have shown strong generative ability. Nano Banana Pro, built on Gemini 3 Pro, claims world knowledge and high‑precision generation without any fine‑tuning. The study asks whether a single natural‑language prompt can solve classic low‑level vision tasks such as dehazing, super‑resolution, deraining, shadow removal, deblurring, etc.
Experiment Design
Fourteen low‑level vision tasks covering 40 public datasets were evaluated in a zero‑shot setting. The model only receives a fixed prompt describing the desired operation; no gradient or task‑specific training is used. Prompt templates:
Image restoration : “Remove the haze/rain/shadow/blur/noise while keeping other elements unchanged.”
Image enhancement : “Upscale/low‑light enhance/underwater enhance/HDR this image.”
Image fusion : “Fuse the multi‑focus/IR‑visible images.”
Results Overview
Visually, Nano Banana Pro produces appealing outputs, but traditional pixel‑level metrics (PSNR, SSIM) are substantially lower than specialized models. Non‑reference perceptual scores (NIQE, NIMA) are often higher, reflecting realistic textures and low noise.
Dehazing : clear sky but often an artificial blue‑sky bias; NIMA 5.44 (highest among methods).
Super‑Resolution : strong texture hallucination and field‑of‑view expansion; PSNR drops >4 dB, NIQE 3.52 (best).
Deraining : visually cleaner images, but PSNR ≈21 dB on Rain200H, and occasional confusion between rain and fog.
Shadow removal : successful removal of hard shadows, yet occasional hand hallucination; PSNR 20.67 dB.
Motion deblur : text becomes readable, but faces may be swapped and color shifts appear; GoPro PSNR 21.41 dB.
Detailed Case Studies
1 Dehazing – “blue‑sky illusion”
Success on heavy haze (RTTS) where distant building details are recovered.
Failure on sunny scenes where the model forces a saturated blue sky.
2 Super‑Resolution – Field‑of‑View expansion
Low NIQE (3.52) and natural denoising.
Failure modes include unintended expansion of the scene beyond ground‑truth and severe text hallucination.
3 Deraining – Rain‑fog ambiguity
Preserves bridge‑cable structures and global semantics.
Sometimes treats fog as rain, leading to pixel‑level deviations.
4 Shadow removal – Unexpected hand
Hard shadows are removed with consistent tone.
Model may hallucinate an extra hand where the shadow was removed.
5 Motion deblur – Identity swap
Low‑light text becomes clear.
Faces can be swapped and color shifts introduced.
Core Conclusion
Generative models act as a “double‑edged sword” for low‑level vision:
Perceptual quality : realistic textures, low noise, high NIQE/NIMA, but pixel‑level drift reduces PSNR/SSIM.
Semantic consistency : globally reasonable structures, yet identity or text hallucinations are common.
Physical fidelity : no guarantee; color, scale, and illumination are often altered.
Zero‑shot generality : the same prompt works across all 14 tasks, but per‑task performance remains below that of dedicated models.
Thus Nano Banana Pro should be viewed as an image repaint engine rather than a conventional restoration model. Future work needs to constrain generative freedom to meet strict visual fidelity requirements.
Resources
https://arxiv.org/pdf/2512.15110
https://github.com/zplusdragon/LowLevelBananaHow this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
