Tagged articles
1 articles
Page 1 of 1
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 12, 2026 · Artificial Intelligence

Breaking Off‑Policy Shift: Bengio’s TBA Decouples Sampling and Learning for 50× Faster LLM RL

Trajectory Balance with Asynchrony (TBA) separates sample generation (Searcher) from model updates (Trainer), uses a trajectory‑balance objective to incorporate off‑policy data, and achieves up to 50× speedup in large‑model RL post‑training while preserving or improving performance on math reasoning, preference fine‑tuning, and red‑team tasks.

LLMasynchronous traininglarge-language-models
0 likes · 10 min read
Breaking Off‑Policy Shift: Bengio’s TBA Decouples Sampling and Learning for 50× Faster LLM RL