Can Vision Transformers Crack the ARC Puzzle? Introducing VARC
MIT researchers argue that the ARC benchmark is essentially a visual problem and present the Vision ARC (VARC) framework, which reformulates ARC as an image‑to‑image translation task using a Vision Transformer, achieving human‑level accuracy through a novel canvas representation and test‑time training.
Background
The Abstraction and Reasoning Corpus (ARC) was introduced by François Chollet in 2019 to evaluate abstract reasoning, a core component of human intelligence. Each ARC task provides only a few input‑output grid examples (typically 2–4) and requires a model to infer the hidden transformation rule and apply it to new instances.
VARC Method Overview
Canvas representation : Input grids are placed on a fixed‑size canvas (e.g., 64×64 pixels). Before placement the grids undergo random scaling and translation, which introduces translation‑ and scale‑invariance and creates richer local patterns.
Vision Transformer backbone : The canvas is processed by a standard Vision Transformer (ViT). A learnable task token conditions the model on the specific ARC problem, and 2‑D sinusoidal positional embeddings preserve spatial structure.
Two‑stage training :
Offline training on all 400 ARC training tasks learns a general ViT model from scratch (no external pre‑training).
Test‑time training (TTT) fine‑tunes the model on the few demonstration examples of a new, unseen task, allowing rapid adaptation during inference.
Experimental Results
On the ARC‑1 benchmark the VARC ensemble achieves 60.4% accuracy, matching the reported human average (60.2%) and surpassing other from‑scratch methods such as HRM and TRM. VARC also remains competitive with much larger LLM‑based approaches while using only ARC data.
Ablation Study
Starting from a naïve baseline, incremental addition of visual priors—2‑D positional encoding, patchification of the canvas, and scale/translation augmentations—yields a cumulative improvement of 27.7 percentage points, demonstrating the importance of treating ARC as a visual problem.
Visualization and Analysis
Attention‑map visualizations show that the model learns meaningful visual patterns. t‑SNE of the learned task embeddings clusters semantically similar tasks (e.g., coloring or logical operations), indicating that VARC captures abstract relationships rather than memorizing examples.
Conclusion and Outlook
VARC provides a clean, vision‑centric paradigm for solving ARC, proving that abstract reasoning can emerge directly from pixel data without linguistic mediation. Future work may explore stronger vision architectures, richer visual priors, or large‑scale image pre‑training to further improve performance.
Paper Details
Title : ARC Is a Vision Problem!
Authors : Keya Hu, Ali Cy, Linlu Qiu, Xiaoman Delores Ding, Runqian Wang, Yeyin Eva Zhu, Jacob Andreas, Kaiming He
Institution : Massachusetts Institute of Technology (MIT)
arXiv URL : https://arxiv.org/abs/2511.14761
Project repository : https://github.com/lillian039/VARC
Code example
收
藏
,
分
享
、
在
看
,
给
个
三
连
击呗!Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
