How Small Can a Transformer Get? Inside the 121‑Parameter AdderBoard Challenge

This article chronicles the AdderBoard competition, detailing how researchers compressed a Transformer for 10‑digit addition down to just 121 parameters, the experimental rules, the contrasting hand‑coded and data‑driven approaches, and the insights gained about model minimalism and discoverability.

Data Party THU
Data Party THU
Data Party THU
How Small Can a Transformer Get? Inside the 121‑Parameter AdderBoard Challenge

Challenge Definition

The AdderBoard challenge asks: what is the smallest autoregressive Transformer that can reliably perform 10‑digit integer addition with at least 99% exact‑match accuracy? The task originated from a Microsoft Research experiment that asked two AI agents (Claude Code and Codex) to minimise the parameter count of a Transformer while meeting the accuracy requirement.

Task Specification

# Objective
Train a vanilla autoregressive Transformer from scratch that achieves ≥99% exact‑match accuracy on 10‑digit addition (A + B) using cross‑entropy loss.

# Data pipeline
- <strong>preprocess(A, B) → model_input</strong>: deterministic tokenisation of the two operands and a separator token.
- <strong>postprocess(model_output) → C</strong>: deterministic detokenisation that yields the sum C.
Both functions must be pure code (no learned components).

# Evaluation
- Held‑out test set of ≥10 000 random 10‑digit pairs.
- No answer encoding in the input, no calculator or symbolic solver at inference time.
- Accuracy measured as full‑sequence exact match.

# Compute constraints
- Single‑machine, limited GPU memory and wall‑clock time.
- Manual monitoring and hyper‑parameter adjustment.

# Deliverables
1. Architecture description (layers, hidden dimension, heads, feed‑forward dimension, total parameters, context length, vocab size).
2. Data‑pipeline code and rationale.
3. Training configuration (optimizer, learning‑rate schedule, batch size, total steps, any curriculum).
4. Training curves (loss vs. step, validation accuracy vs. step).
5. Final evaluation on the 10 000‑example test set.
6. Full experimental log of attempts and reasoning.

Key Results and Contrasting Architectures

Two AI agents produced markedly different solutions:

Claude Code kept the model generic and arrived at a 6 080‑parameter architecture.

Codex pursued extreme compression, encoding both numbers into a single token and achieving a 1 644‑parameter model.

The open‑source repository for the challenge is:

https://github.com/anadim/AdderBoard

Hand‑Coded (White‑Box) Stream

Researchers explored manually crafted weight settings to probe theoretical limits.

130‑parameter solution (by @cosminscn): uses an 11‑period triangular function for positional alignment and carefully set biases so that a ReLU activation implements the carry‑over logic.

121‑parameter solution (by @Wonderfall): starts from the vanilla Qwen‑3 architecture, reduces the layer count to one, shrinks the hidden dimension to d=3, and compresses the feed‑forward layer to dimension 2. The remaining carry signal is routed through the RMSNorm layer, exploiting its slope as a carrier for the carry state.

Data‑Driven (Trained) Stream

Optimizers were tasked with discovering addition rules in ultra‑low‑dimensional spaces, a process that is highly sensitive to convergence.

491‑parameter model : removes the bias term from LayerNorm and replaces LayerNorm with RMSNorm, saving 21 parameters.

456‑parameter model : ties the Key and Value projection matrices because their functional roles overlap in the alignment task.

311‑parameter model (maintained by Reza Bayat): trained for 162 000 steps before plateauing with ~2 780 errors. Three rounds of learning‑rate‑scaled fine‑tuning reduced the error count to 1, achieving 99.999% accuracy on the 10 000‑example test set.

rng = random.Random(seed)
random_cases = [
    (rng.randint(0, 9_999_999_999), rng.randint(0, 9_999_999_999))
    for _ in range(num_tests)
]
all_cases = edge_cases + random_cases

Microscopic Structural Analysis

The 121‑parameter winner keeps the original Qwen‑3 architecture unchanged except for dimension reductions, demonstrating that extreme compression does not require a novel backbone.

Hand‑coded solutions bypass the discoverability barrier of stochastic gradient descent, explaining why they can reach far lower parameter counts than trained models.

Rank‑3 matrix decomposition combined with prolonged “grokking” phases enabled the 311‑parameter model to cross the dimensional cliff from d=7 to d=4.

A sharp performance cliff appears around 800 parameters; dropping below this threshold makes convergence dramatically harder.

Single‑layer Transformers outperform two‑layer variants at equal parameter budgets, highlighting width over depth for ultra‑compact models.

Learnable positional encodings are crucial: replacing them with fixed sinusoidal encodings in the 491‑parameter model caused zero successful runs across 56 random seeds.

Conclusions

The AdderBoard project shows that a vanilla Transformer can perform exact 10‑digit addition with as few as 121 parameters when dimensions are aggressively reduced and when the carry signal is encoded via RMSNorm. The challenge also illustrates a new research paradigm—“Vibe Researching”—where humans define extreme goals and constraints while AI agents handle low‑level weight synthesis, massive seed searches, and fine‑tuning.

The open‑source verification script in the repository allows anyone to attempt further reductions.

AdderBoard challenge illustration
AdderBoard challenge illustration
model compressionTransformerparameter efficiencyAdderBoard
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.