Machine Heart
Jun 24, 2026 · Artificial Intelligence
From Pixels to Words: A Native Vision-Language Model Unifies Images and Video
The paper introduces NEO‑ov, a native vision‑language model that discards external visual encoders, feeding raw pixels directly into a unified transformer, and demonstrates competitive performance on image, multi‑image, and video tasks—including fine‑grained perception and spatial reasoning—while outlining its three‑stage training pipeline and current limitations.
Qwenbenchmarkmultimodal
0 likes · 13 min read
