From Pixels to Words: A Native Vision-Language Model Unifies Images and Video

The paper introduces NEO‑ov, a native vision‑language model that discards external visual encoders, feeding raw pixels directly into a unified transformer, and demonstrates competitive performance on image, multi‑image, and video tasks—including fine‑grained perception and spatial reasoning—while outlining its three‑stage training pipeline and current limitations.

Machine Heart
Machine Heart
Machine Heart
From Pixels to Words: A Native Vision-Language Model Unifies Images and Video

Most mainstream visual language models (VLMs) follow a modular pipeline: a pretrained visual encoder (e.g., CLIP, SigLIP) compresses images into features, which are then projected into a large language model. This "encoder + projection + LLM" design inevitably loses fine visual details, limits flexibility across single‑image, multi‑image, and video inputs, incurs high computational cost for high‑resolution or long‑video data, and makes scaling cumbersome.

NEO‑ov challenges this assumption by removing the external visual encoder entirely. Raw pixels are first processed by two lightweight convolutional layers with GELU activation, followed by two down‑sampling steps. Each resulting 32×32 patch becomes a visual token wrapped with

tags and concatenated with text tokens, forming a single sequence that is fed into a unified transformer.

The transformer’s attention heads are explicitly split into three branches: a temporal (T) branch that inherits the original LLM’s sequence modeling, and two spatial branches (H and W) that encode the height and width dimensions of visual tokens. Native‑RoPE further decouples time and space by assigning only temporal indices to text tokens, while visual tokens share a common temporal index and receive additional H/W positional indices.

Single‑image, multi‑image, and video inputs are handled uniformly. Each image or video frame is treated as an independent visual unit inserted into the sequence at its appropriate position. Video frames receive timestamps and a global prefix that records video length, frame count, and sampling rate, effectively representing a video as an ordered list of images. Multi‑image inputs require no extra cross‑image modules; the same attention mechanism naturally captures inter‑image relationships.

Training proceeds in three progressive stages. Stage 1 (pre‑training) uses ~20 M image‑text pairs—including descriptive captions and OCR data—to align the new visual modules with the language space while preserving the model’s original language capabilities. Stage 2 (mid‑training) scales up to ~60 M multimodal samples, increasing image resolution from 256² to 4096², extending video length to 128 frames, and expanding context length from 16 K to 36 K, thereby strengthening high‑resolution perception and spatio‑temporal reasoning. Stage 3 (supervised fine‑tuning) employs ~6 M high‑quality instruction examples covering visual QA, OCR, fine‑grained perception, temporal reasoning, mathematics, and complex dialogue to further boost overall performance.

Built on the Qwen3‑1.7B and Qwen3‑8B backbones, NEO‑ov is released in two sizes: NEO‑ov 2B and NEO‑ov 9B.

Benchmark results show that NEO‑ov sets new state‑of‑the‑art performance among native VLMs on image understanding tasks such as MMMU, HallusionBench, and InfoVQA, surpassing prior works like EVE, Mono‑InternVL, OneCAT, and SAIL. On video and multi‑image benchmarks (VideoMME, MVBench, MLVU, BLINK, MUIRBench, LongVideoBench) it competes head‑to‑head with top modular models such as InternVL3.5 and Qwen3‑VL. Notably, on spatial‑intelligence suites (ViewSpatial, 3DSR, SPAR) it outperforms specialized models including Cambrian‑S, Sensenova‑SI, and GeoThinker.

The authors attribute the spatial advantage to two factors: (1) the Pre‑Buffer mechanism preserves richer pixel‑pixel and pixel‑token interactions than compressed encoder representations, and (2) early cross‑modal interaction (at the patch level) provides more informative cues for spatial reasoning. Progressive training yields steady gains across all model sizes, with the smaller 2B model benefiting most.

Remaining shortcomings are acknowledged: on certain single‑image and video benchmarks NEO‑ov still trails models like Qwen3‑VL, likely due to limited training data scale and quality. OCR and document understanding remain weaker because the model lacks dedicated OCR pre‑training. The authors suggest that larger scales, richer data, and longer context windows could further close these gaps.

In summary, NEO‑ov demonstrates that a fully native, end‑to‑end vision‑language architecture—without handcrafted encoders, adapters, or post‑hoc fusion—can achieve competitiveness with the best modular systems, offering a promising direction for unified multimodal intelligence.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

benchmarkQwenmultimodalvision-languagespatial-reasoningnative-model
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.