Jun 24, 2026 · Artificial Intelligence

From Pixels to Words: A Native Vision-Language Model Unifies Images and Video

The paper introduces NEO‑ov, a native vision‑language model that discards external visual encoders, feeding raw pixels directly into a unified transformer, and demonstrates competitive performance on image, multi‑image, and video tasks—including fine‑grained perception and spatial reasoning—while outlining its three‑stage training pipeline and current limitations.

Qwenbenchmarkmultimodal

0 likes · 13 min read

From Pixels to Words: A Native Vision-Language Model Unifies Images and Video

native-model

From Pixels to Words: A Native Vision-Language Model Unifies Images and Video