Why Visual Perception Limits STEM Large Models and How CodePercept Breaks the Barrier
The authors demonstrate that visual perception, not reasoning, is the primary bottleneck for STEM multimodal large language models, introduce the CodePercept paradigm and the ICC-1M dataset, and show that code‑driven perception dramatically improves performance, surpassing much larger models on new benchmarks.
Multimodal large language models (MLLMs) often fail on visual STEM reasoning tasks, prompting the fundamental question: is the failure due to limited reasoning ability ("a dumb brain") or poor visual perception ("bad eyesight")?
Systematic Two‑Stage Analysis
The Shanghai Jiao‑Tong University team, together with the Qwen research group, decomposes the problem into two stages—visual perception (image‑to‑description) and reasoning (text‑only problem solving). By independently scaling each capability while keeping the other fixed, they find that expanding perception consistently yields larger performance gains than expanding reasoning, revealing that perception is the current bottleneck.
CodePercept: A Code‑Grounded Visual Perception Paradigm
To address this bottleneck, the team proposes CodePercept , a novel paradigm that uses executable Python code as a powerful visual perception medium. Two code‑driven tasks are introduced:
Code‑Grounded Caption Generation : The model generates captions grounded in code, using concrete facts such as coordinates and counts extracted from the code to eliminate hallucinations.
STEM Image‑to‑Code Translation : The model directly produces executable reconstruction code for the image, removing the ambiguity of natural‑language descriptions.
Dataset Construction – ICC‑1M
To train CodePercept, the researchers build the ICC‑1M dataset, containing one million high‑quality Image‑Caption‑Code triples. The dataset is created through three pipelines that ensure image reproduction, image diversity, and solid geometry synthesis, each verified by a three‑stage quality control process (image quality, code quality, and image‑code consistency).
Training Strategy
CodePercept is trained in two phases:
Phase 1 – Supervised Fine‑Tuning (CodePercept‑S1) : Jointly optimizes Image‑to‑Caption and Image‑to‑Code tasks, treating code as a formatted caption to strengthen perception.
Phase 2 – Reinforcement Learning (CodePercept‑R1) : Applies GRPO‑based RL to reward syntactic correctness, execution fidelity, and image‑code similarity, driving exponential performance gains.
Benchmarks and Results
The team releases two evaluation suites:
STEM2Code‑Eval : A manually curated benchmark of 1,000 images requiring models to generate Python code that perfectly reconstructs the original image.
STEM2Code‑Eval Benchmark : Extends the evaluation to assess pure visual perception via image reconstruction.
Experiments using the Qwen3‑VL architecture show striking outcomes:
On the traditional caption‑plus‑solver pipeline, CodePercept‑8B‑S1 outperforms the much larger open‑source Qwen2.5‑VL‑72B by 6.2% and approaches the performance of leading closed‑source models such as Claude‑Opus 4.1‑Thinking and GPT‑5‑Thinking.
On the pure perception task (STEM2Code‑Eval), the RL‑enhanced CodePercept‑8B‑R1 achieves a score of 63.56, a 3.92‑point gain over the base model and surpasses flagship models like Seed 1.6‑Vision and Qwen3‑VL‑Plus despite having far fewer parameters.
Conclusion
CodePercept demonstrates that equipping multimodal LLMs with code‑driven visual perception can unlock complex scientific reasoning, establishing a new direction where executable code serves as the "sharp eye" needed to solve challenging STEM problems.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
