How Linear Attention Learns “Write‑Before‑Think”: Parallel Multi‑Step Memory Writes with PRISM
PRISM demonstrates that linear‑attention models can adopt a “write‑before‑think” paradigm by reconstructing the multi‑step step‑size × residual × direction iteration of Test‑Time Training, achieving Transformer‑level quality while delivering up to 174× higher throughput through parallel scan and fused kernels.
