Bypassing BPTT: MIT’s SMT Puts RNNs on the Parallel Training Path

The article reviews MIT’s Supervised Memory Training (SMT) and its DAgger extension (DMT), which replace traditional back‑propagation through time with a Transformer‑based teacher, enabling one‑step memory supervision for RNNs, achieving parallel‑friendly training and superior long‑sequence performance on synthetic benchmarks, TinyStories and pixel‑wise image generation.

BPTTDMTRNN

0 likes · 10 min read

Bypassing BPTT: MIT’s SMT Puts RNNs on the Parallel Training Path