Bypassing BPTT: MIT’s SMT Puts RNNs on the Parallel Training Path

The article reviews MIT’s Supervised Memory Training (SMT) and its DAgger extension (DMT), which replace traditional back‑propagation through time with a Transformer‑based teacher, enabling one‑step memory supervision for RNNs, achieving parallel‑friendly training and superior long‑sequence performance on synthetic benchmarks, TinyStories and pixel‑wise image generation.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Bypassing BPTT: MIT’s SMT Puts RNNs on the Parallel Training Path

RNNs have been eclipsed by Transformers because the latter support parallel training and stable long‑range credit assignment, while RNNs rely on back‑propagation through time (BPTT), which expands the computation graph across the entire sequence, causing gradient vanishing/explosion and limiting parallelism.

MIT researchers propose Supervised Memory Training (SMT) , which decouples memory learning from sequence update. First, a Transformer encoder‑decoder teacher generates a memory state for each time step. The RNN is then trained to predict the next memory state in a single‑step supervised manner, eliminating the need for full BPTT during pre‑training.

SMT treats the optimal memory as a permutation‑invariant function over time‑stamped events, allowing the Transformer to estimate it. The training objective combines three parts: (1) decoder predicts future tokens, (2) RNN predicts the next memory, and (3) a uniformity loss prevents memory collapse. This yields a credit‑assignment chain that is short and less sensitive to sequence length.

After SMT, the authors introduce DAgger Memory Training (DMT) , which lets the RNN continue training on its own generated memory distribution while aligning to the teacher’s trajectory. DMT is not fully parallel but serves as a lightweight fine‑tuning stage.

The paper analyses the trade‑off between parallel training cost and inference cost. Transformers incur growing inference cost with sequence length, whereas BPTT‑trained RNNs have constant‑size memory and low inference cost. SMT recombines these trade‑offs: training retains sequential operations and credit‑assignment benefits, while inference keeps the RNN’s fixed‑size memory and single‑step computation.

Experimental results on five synthetic tasks (gradient stability, memory capacity, state tracking, associative recall, context learning) show that SMT→DMT consistently outperforms BPTT, especially as sequence length grows. On pixel‑wise generation tasks (MNIST, Sketchy), non‑gated RNNs trained with SMT→DMT preserve digit structure better than BPTT‑trained GRUs. On TinyStories and MNIST language modeling, SMT‑based models achieve lower sequential compute than BPTT while matching or surpassing data efficiency; GRU backbones suffer memory‑space collapse under SMT.

The authors note that while SMT relies on a Transformer teacher (which has limited expressive power) and DMT is not fully parallel, the approach demonstrates that nonlinear RNNs can re‑enter the scaling discussion, offering fixed‑size memory updates that may be crucial for future systems that cannot store the entire history.

Limitations include the lack of evidence for scaling SMT to large‑scale LLM pre‑training and the need for possible post‑training or BPTT fine‑tuning to surpass the teacher’s performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

TransformerRNNparallel trainingsequence modelingSMTBPTTDMT
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.