Can a 3B Open‑Source Multimodal Model Beat GPT‑4V in Math? A Deep Dive into VLR1‑3B

The preview release of the 3‑billion‑parameter VLR1‑3B multimodal model demonstrates state‑of‑the‑art reasoning on math benchmarks, outperforms many commercial closed‑source models, and shows promising results on geometry, physics, and general vision tasks, while also revealing typical hallucination issues.

Tencent Technical Engineering
Tencent Technical Engineering
Tencent Technical Engineering
Can a 3B Open‑Source Multimodal Model Beat GPT‑4V in Math? A Deep Dive into VLR1‑3B

The authors Xu Junzhe and Yin Yuyang introduce the preview version of their open‑source multimodal model VLR1‑3B, a 3‑billion‑parameter system trained with reinforcement learning to enhance inference performance. Benchmark tests on the MathVista and MathVision leaderboards show that VLR1‑3B achieves the highest inference capability among peer models, even surpassing commercial giants such as Gemini 1.5 and GPT‑4V on several metrics.

Benchmark Highlights

Average score: 35.7 (top among evaluated models)

MathVista: 64.8 (best overall)

MathVision: 25.0

MathVerse: 33.2

Other strong performers include Qwen2.5‑VL‑3B (31.8 average) and Taichu‑VLR‑3B (33.6 average).

The model’s strength lies in its mathematical reasoning, which the authors demonstrate through a series of real‑world homework assistance scenarios.

AI Homework Assistant

Using VLR1‑3B, the authors built a lightweight “AI homework helper” that not only provides correct answers but also generates step‑by‑step derivations, a crucial feature for math tutoring.

Polynomial Computation

Two basic calculation problems from actual exam papers were fed to the system. The model accurately recognized handwritten equations, performed the calculations, and presented clear solution steps, as shown in the screenshots below.

Another example demonstrates correct formula application and a correct final answer.

Coordinate System Understanding

The model correctly identified the key points of a problem involving a Cartesian plane, explained why both coordinates were negative, analyzed each answer choice, and selected the correct one.

Function Evaluation

Given a function image, VLR1‑3B extracted the formula, computed f(4), and displayed the reasoning and final answer.

Plane Geometry

Two fill‑in‑the‑blank geometry questions were solved with detailed proofs, including the use of symbols such as “∵” and “∴”. The model also tackled a more complex proof problem, delivering a clear logical chain.

Physics

A physics problem was also answered correctly, with the model explaining its reasoning.

Beyond Math: General Multimodal Capabilities

The authors further tested VLR1‑3B on non‑academic visual queries. When shown a cat image, the model accurately identified the breed, pattern, and estimated age.

Leveraging the author’s background in autonomous driving, the model was asked to recognize vehicles in street scenes, correctly naming types, colors, and even inferring driving intent in nighttime footage.

Limitations and Future Work

During testing, occasional hallucinations were observed: the model sometimes produced plausible‑looking reasoning that contained factual errors, even when the final answer was correct. The authors suggest that fine‑tuning with domain‑specific data can mitigate these issues.

Given its modest 3B parameter size, VLR1‑3B shows promise as a locally runnable “home AI homework assistant.” The team plans to release a detailed technical report, academic paper, and larger‑scale models in the future.

multimodal AIOpen Sourcebenchmarkmath reasoningVLR1-3B
Tencent Technical Engineering
Written by

Tencent Technical Engineering

Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.