Training a Transformer on a 1970s PDP‑11 Takes Only 5.5 Minutes
A developer recreated a 1970s PDP‑11 environment, wrote a single‑layer, single‑head Transformer in assembly, and trained it on a sequence‑reversal task, achieving 100% accuracy after about 350 steps and a total training time of roughly 5.5 minutes.
Imagine bringing today’s large‑model techniques back to the 1970s, when there were no GPUs, no CUDA, no floating‑point units, and only a PDP‑11 minicomputer programmed in assembly.
Project ATTN‑11
The author reproduced the era’s hardware and implemented a single‑layer, single‑head Transformer entirely in PDP‑11 assembly, calling the project ATTN‑11. The model is trained on a simple “sequence reversal” task: given a list of digits, the network must output the digits in reverse order (e.g., input 4 7 4 9 6 3 6 5 → output 5 6 3 6 9 4 7 4). This task forces the network to learn positional mappings, which is the core capability of the self‑attention mechanism.
Results show that the 1 216‑parameter model reaches 100 % accuracy after roughly 350 training steps, and the whole training process finishes in about five minutes (≈5.5 min). The author humorously names the project “Paper Tape is All You Need”.
Community Reaction
Readers compared the achievement to modern scaling‑law trends, noting that even with extremely limited resources the Transformer can close the functional loop, and many wondered whether such capabilities have always been possible. Some pointed out that later machines like the 1984 Cray X‑MP or the 1990s Cray T3E could train far larger models, emphasizing that the real bottleneck is often ideas rather than hardware.
Architecture
The implementation adds three components to the basic neural‑network stack:
Self‑attention: dot‑product scores between queries and keys.
Positional encoding: learned position embeddings added to the input.
Softmax: conversion of scores to a probability distribution.
The resulting model is a minimal Transformer consisting of an embedding layer, a residual self‑attention layer, and an output mapping. It lacks layer‑norm, feed‑forward networks, and a decoder.
Optimizations for 1970s Hardware
The first version used a uniform learning rate of 0.01, requiring 100 steps (≈25 min) and about 1 500 steps for full accuracy—roughly 6.5 hours on a PDP‑11 and potentially a week on an IBM 1130. To make training feasible, the author switched to hierarchical learning rates: a higher rate for the attention weights and a lower rate for the output mapping. This reduced the required steps to about 600 (≈2.5 h).
Only basic stochastic gradient descent (SGD) is used; more advanced optimizers like Adam would triple memory usage and add costly square‑root/division operations. The hierarchical learning‑rate scheme achieves comparable results without extra memory, allowing the model to fit into 32 KB of core memory instead of 64 KB.
NN11 Compute Stack
ATTN‑11 relies on NN11, a minimal fixed‑point neural‑network stack designed for the PDP‑11. NN11 mirrors BLAS with three layers: scalar operations (FXMATH), vector operations (VECOP), and matrix‑vector operations (MATOP). Two additional modules—ACTFN (activation lookup tables) and LAYER (composition of operations)—provide the necessary functionality for attention and mapping.
Forward computation uses Q8 fixed‑point numbers, while back‑propagation uses Q15. Multiplying a Q8 by a Q15 yields a Q23 result that can be scaled back to Q15 with a single ASHC #-8 instruction, keeping the backward‑propagation cost comparable to forward computation and giving gradient precision 128 × the activation precision.
Training Outcome
After the optimizations, the model converges in 350 steps, reducing total training time on the developer’s PDP‑11/34A to about 5.5 minutes. The code is loaded directly into memory via the console rather than using a physical paper‑tape reader.
Prototype and Verification
Before writing assembly, the author prototyped the arithmetic in Sheaf, a functional ML framework with built‑in observability. Sheaf reduced code size by roughly one‑third, provided stronger correctness guarantees, and allowed per‑tensor tracking of shape, range, and execution time. An example shows a range guard catching a missing >>8 shift.
Implementation Details
Because the PDP‑11 lacks a floating‑point unit, expensive functions like exp and log are replaced by pre‑computed lookup tables. The softmax uses a 256‑entry Q8 table (EXPTBL) mapping index i to exp(‑i/32). The algorithm proceeds in three steps: (1) subtract the maximum value for numerical stability, (2) divide the difference by 8 to obtain a table index (clamped to [0,255]), and (3) look up the exponent, sum the results, and divide by the sum using FXDIV. The cross‑entropy loss is computed every 50 steps using a 257‑entry Q12 table (LOGTBL) that maps values to ‑ln(x/256) × 4096. Accumulated loss values are stored in a 32‑bit register and scaled back with ASHC #-3, providing four‑decimal‑place precision.
Memory Layout
The entire binary occupies 19.2 KB. The 1 216 parameters are stored three times (Q16 accumulator, Q8 forward, Q15 gradient), consuming 9.6 KB. The remaining memory holds code, lookup tables, and runtime data.
Build and Run
Building requires the MACRO‑11 assembler and the obj2bin tool to produce a loadable binary. The program can run on a real PDP‑11 with EIS support and at least 32 KB of core memory, on the author’s ll‑34 cycle‑accurate PDP‑11/34 simulator, via a WebAssembly port, or on SIMH (though SIMH’s timing is not cycle‑accurate).
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
