Artificial Intelligence 9 min read

Can Vision Transformers Crack the ARC Puzzle? Introducing VARC

MIT researchers argue that the ARC benchmark is essentially a visual problem and present the Vision ARC (VARC) framework, which reformulates ARC as an image‑to‑image translation task using a Vision Transformer, achieving human‑level accuracy through a novel canvas representation and test‑time training.

AI Frontier Lectures

Nov 22, 2025

Can Vision Transformers Crack the ARC Puzzle? Introducing VARC

Background

The Abstraction and Reasoning Corpus (ARC) was introduced by François Chollet in 2019 to evaluate abstract reasoning, a core component of human intelligence. Each ARC task provides only a few input‑output grid examples (typically 2–4) and requires a model to infer the hidden transformation rule and apply it to new instances.

VARC Method Overview

Canvas representation : Input grids are placed on a fixed‑size canvas (e.g., 64×64 pixels). Before placement the grids undergo random scaling and translation, which introduces translation‑ and scale‑invariance and creates richer local patterns.

Vision Transformer backbone : The canvas is processed by a standard Vision Transformer (ViT). A learnable task token conditions the model on the specific ARC problem, and 2‑D sinusoidal positional embeddings preserve spatial structure.

Two‑stage training :

Offline training on all 400 ARC training tasks learns a general ViT model from scratch (no external pre‑training).

Test‑time training (TTT) fine‑tunes the model on the few demonstration examples of a new, unseen task, allowing rapid adaptation during inference.

Experimental Results

On the ARC‑1 benchmark the VARC ensemble achieves 60.4% accuracy, matching the reported human average (60.2%) and surpassing other from‑scratch methods such as HRM and TRM. VARC also remains competitive with much larger LLM‑based approaches while using only ARC data.

Ablation Study

Starting from a naïve baseline, incremental addition of visual priors—2‑D positional encoding, patchification of the canvas, and scale/translation augmentations—yields a cumulative improvement of 27.7 percentage points, demonstrating the importance of treating ARC as a visual problem.

Visualization and Analysis

Attention‑map visualizations show that the model learns meaningful visual patterns. t‑SNE of the learned task embeddings clusters semantically similar tasks (e.g., coloring or logical operations), indicating that VARC captures abstract relationships rather than memorizing examples.

Conclusion and Outlook

VARC provides a clean, vision‑centric paradigm for solving ARC, proving that abstract reasoning can emerge directly from pixel data without linguistic mediation. Future work may explore stronger vision architectures, richer visual priors, or large‑scale image pre‑training to further improve performance.

Paper Details

Title : ARC Is a Vision Problem!

Authors : Keya Hu, Ali Cy, Linlu Qiu, Xiaoman Delores Ding, Runqian Wang, Yeyin Eva Zhu, Jacob Andreas, Kaiming He

Institution : Massachusetts Institute of Technology (MIT)

arXiv URL : https://arxiv.org/abs/2511.14761

Project repository : https://github.com/lillian039/VARC

Code example

收
藏
，
分
享
、
在
看
，
给
个
三
连
击呗！

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Artificial Intelligence vision transformer ARC Image-to-Image Translation test-time training

Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.