From Pixels to Words: A Native Vision-Language Model Unifies Images and Video
The paper introduces NEO‑ov, a native vision‑language model that discards external visual encoders, feeding raw pixels directly into a unified transformer, and demonstrates competitive performance on image, multi‑image, and video tasks—including fine‑grained perception and spatial reasoning—while outlining its three‑stage training pipeline and current limitations.
Most mainstream visual language models (VLMs) follow a modular pipeline: a pretrained visual encoder (e.g., CLIP, SigLIP) compresses images into features, which are then projected into a large language model. This "encoder + projection + LLM" design inevitably loses fine visual details, limits flexibility across single‑image, multi‑image, and video inputs, incurs high computational cost for high‑resolution or long‑video data, and makes scaling cumbersome.
NEO‑ov challenges this assumption by removing the external visual encoder entirely. Raw pixels are first processed by two lightweight convolutional layers with GELU activation, followed by two down‑sampling steps. Each resulting 32×32 patch becomes a visual token wrapped with
tags and concatenated with text tokens, forming a single sequence that is fed into a unified transformer.
The transformer’s attention heads are explicitly split into three branches: a temporal (T) branch that inherits the original LLM’s sequence modeling, and two spatial branches (H and W) that encode the height and width dimensions of visual tokens. Native‑RoPE further decouples time and space by assigning only temporal indices to text tokens, while visual tokens share a common temporal index and receive additional H/W positional indices.
Single‑image, multi‑image, and video inputs are handled uniformly. Each image or video frame is treated as an independent visual unit inserted into the sequence at its appropriate position. Video frames receive timestamps and a global prefix that records video length, frame count, and sampling rate, effectively representing a video as an ordered list of images. Multi‑image inputs require no extra cross‑image modules; the same attention mechanism naturally captures inter‑image relationships.
Training proceeds in three progressive stages. Stage 1 (pre‑training) uses ~20 M image‑text pairs—including descriptive captions and OCR data—to align the new visual modules with the language space while preserving the model’s original language capabilities. Stage 2 (mid‑training) scales up to ~60 M multimodal samples, increasing image resolution from 256² to 4096², extending video length to 128 frames, and expanding context length from 16 K to 36 K, thereby strengthening high‑resolution perception and spatio‑temporal reasoning. Stage 3 (supervised fine‑tuning) employs ~6 M high‑quality instruction examples covering visual QA, OCR, fine‑grained perception, temporal reasoning, mathematics, and complex dialogue to further boost overall performance.
Built on the Qwen3‑1.7B and Qwen3‑8B backbones, NEO‑ov is released in two sizes: NEO‑ov 2B and NEO‑ov 9B.
Benchmark results show that NEO‑ov sets new state‑of‑the‑art performance among native VLMs on image understanding tasks such as MMMU, HallusionBench, and InfoVQA, surpassing prior works like EVE, Mono‑InternVL, OneCAT, and SAIL. On video and multi‑image benchmarks (VideoMME, MVBench, MLVU, BLINK, MUIRBench, LongVideoBench) it competes head‑to‑head with top modular models such as InternVL3.5 and Qwen3‑VL. Notably, on spatial‑intelligence suites (ViewSpatial, 3DSR, SPAR) it outperforms specialized models including Cambrian‑S, Sensenova‑SI, and GeoThinker.
The authors attribute the spatial advantage to two factors: (1) the Pre‑Buffer mechanism preserves richer pixel‑pixel and pixel‑token interactions than compressed encoder representations, and (2) early cross‑modal interaction (at the patch level) provides more informative cues for spatial reasoning. Progressive training yields steady gains across all model sizes, with the smaller 2B model benefiting most.
Remaining shortcomings are acknowledged: on certain single‑image and video benchmarks NEO‑ov still trails models like Qwen3‑VL, likely due to limited training data scale and quality. OCR and document understanding remain weaker because the model lacks dedicated OCR pre‑training. The authors suggest that larger scales, richer data, and longer context windows could further close these gaps.
In summary, NEO‑ov demonstrates that a fully native, end‑to‑end vision‑language architecture—without handcrafted encoders, adapters, or post‑hoc fusion—can achieve competitiveness with the best modular systems, offering a promising direction for unified multimodal intelligence.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
