DeepSeek’s New Multimodal Paradigm Compresses Images 7,056× and Outperforms GPT‑4/Claude in Visual Reasoning

DeepSeek’s multimodal model, built on the V4‑Flash architecture and a visual‑primitive reasoning approach, compresses a full‑resolution image by 7,056 times, achieves comparable or superior performance to GPT‑5.4 and Claude‑Sonnet‑4.6 on counting and spatial‑reasoning benchmarks, and does so with dramatically lower compute.

SuanNi
SuanNi
SuanNi
DeepSeek’s New Multimodal Paradigm Compresses Images 7,056× and Outperforms GPT‑4/Claude in Visual Reasoning

DeepSeek recently added a multimodal capability to its V4 model, currently in a gray‑scale public test where users can switch the chat window to an image‑understanding mode.

The hidden language‑vision gap

Large language models excel at tasks such as OCR, poetry, and math, but they falter when required to reason about complex spatial layouts. The authors describe this as a “referential gap”: language alone is ambiguous when describing positions in an image, causing chain‑of‑thought (CoT) reasoning to collapse.

Teaching AI to point while thinking

DeepSeek’s solution, called “Thinking with Visual Primitives,” turns visual markers—points and bounding boxes—into the basic units of reasoning. By outputting coordinates directly (e.g., drawing a box around a target), the model anchors its logical chain to concrete spatial anchors, much like a person uses a finger to count objects.

Efficient architecture

The approach is implemented on a compact model named DeepSeek‑V4‑Flash, which employs a compressed sparse‑attention mechanism. A full‑resolution image is reduced to roughly 90 memory slots, achieving a compression ratio of 7,056× compared with other top‑tier models that require thousands of slots.

Data collection and cleaning

The team first harvested about 98,000 raw sources that might contain detection annotations. After a two‑stage AI‑driven quality pipeline—semantic filtering (removing nonsensical labels) and geometric checks (discarding truncated or overly large boxes)—the dataset was trimmed to 31,000 high‑quality samples and then balanced to over 40 million annotated instances.

Three‑phase training pipeline

Stage 1: Foundation – The base model learns to output accurate boxes and points. Because a box is defined by two points, point‑prediction capability emerges automatically.

Stage 2: Expert specialization – Four “training camps” are designed to force visual reasoning:

Counting: coarse‑grained (count all dogs) and fine‑grained (count white dogs lying on the floor) using synthesized GQA‑based constraints.

Spatial reasoning: synthetic CLEVR scenes with explicit execution traces mapped to visual anchors.

Maze navigation: generated mazes of varying shapes (rectangular, circular, hexagonal) using depth‑first‑search or Prim algorithms, requiring step‑by‑step box‑based reasoning.

Path tracing: models must follow a curved line, emitting a sequence of points and slowing down at dense intersections.

Reward models are tailored to each task: counting rewards decay exponentially with relative error; maze rewards penalize wall‑crossing and reward correct use of visual primitives at each step.

Benchmark results

In a unified API‑based evaluation with low‑budget inference settings, DeepSeek‑V4‑Flash matches or exceeds GPT‑5.4 and Claude‑Sonnet‑4.6 on core tasks. Notably, on the DS_Finegrained_Counting test the model scores 88.7 points versus 84.2 for GPT‑5.4. On maze navigation and path‑tracing it achieves 66.9 and 56.7 points respectively, far ahead of competitors that linger around 30–50 points.

The efficiency advantage is evident: the model processes an 800×800 image while keeping only about 90 KV cache entries.

Limitations

Resolution remains a bottleneck; extremely fine‑grained scenes can suffer from imprecise boxes. The model also requires specific trigger phrases to activate multimodal reasoning and the generalization of point‑based primitives to novel topologies is still limited.

Takeaway

By integrating visual primitives into the reasoning chain, DeepSeek demonstrates a concrete method for bridging the language‑vision referential gap, achieving high‑performance multimodal reasoning with unprecedented computational efficiency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

multimodal AImodel compressionLarge language modelsDeepSeekvisual reasoningvisual primitives
SuanNi
Written by

SuanNi

A community for AI developers that aggregates large-model development services, models, and compute power.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.