Artificial Intelligence 10 min read

Breaking Off‑Policy Shift: Bengio’s TBA Decouples Sampling and Learning for 50× Faster LLM RL

Trajectory Balance with Asynchrony (TBA) separates sample generation (Searcher) from model updates (Trainer), uses a trajectory‑balance objective to incorporate off‑policy data, and achieves up to 50× speedup in large‑model RL post‑training while preserving or improving performance on math reasoning, preference fine‑tuning, and red‑team tasks.

Machine Learning Algorithms & Natural Language Processing

May 12, 2026

Breaking Off‑Policy Shift: Bengio’s TBA Decouples Sampling and Learning for 50× Faster LLM RL

Problem Background

LLM post‑training with on‑policy reinforcement learning (PPO, RLOO, GRPO) is limited by token‑by‑token rollout generation; the trainer must wait for rollouts, and after each policy update the collected trajectories become off‑policy, reducing data reuse.

Architecture (TBA)

Trajectory Balance with Asynchrony (TBA) decouples exploration and learning into two pipelines.

Searcher : keeps a slightly stale copy of the model, samples prompts from a replay buffer, generates full‑length responses, and stores them locally.

Trainer : continuously draws batches from a global replay buffer and updates the policy without waiting for rollouts.

Every k optimization steps the Trainer broadcasts its latest weights to all Searchers and merges the Searchers’ local buffers into the global buffer. This periodic synchronization limits policy drift and improves cluster utilization.

Trajectory‑Balance Objective

TBA adopts the Trajectory Balance (TB) objective from GFlowNets, which is off‑policy by construction: any trajectory can be used provided the sampling distribution has full support. The implementation uses the VarGrad TB variant that estimates a reward for multiple responses to the same prompt, avoiding a separate density‑model. When data are on‑policy the loss reduces to a REINFORCE‑like form with a mean baseline; in the asynchronous off‑policy regime it remains stable, unlike importance‑sampling corrections used by traditional on‑policy methods.

Dynamic Replay‑Buffer Sampling (MOP)

To avoid wasteful random sampling and mode collapse from pure reward‑based priority sampling, TBA introduces a mixed‑sampling scheme called Most‑On‑Policy Probability (MOP). For each batch:

With probability m , select trajectories that entered the buffer during the most recent synchronization (closest to the current policy).

With probability 1‑m , sample from the entire buffer using a softmax over reward scores combined with uniform sampling, preserving diversity.

Experimental Evaluation

Three post‑training tasks were evaluated on a 4 × A100 cluster.

Mathematical reasoning (GSM8K) : wall‑clock time reduced by ≈ 50× relative to VinePPO; Pass@1 accuracy increased by 1.2–1.8 %.

Preference fine‑tuning (TL;DR summarisation) : achieved a superior Pareto front between KL‑regularised perplexity and win‑rate, yielding higher‑quality summaries at lower compute cost.

Automatic red‑team attacks (sparse‑reward) : wall‑clock time shortened by up to 7× compared with a non‑distributed synchronous GFlowNet baseline; increasing the number of Searchers consistently raised attack success rate and prompt diversity.

Ablation on GSM8K showed that increasing the number of responses per query (e.g., K = 20 → K = 40) lowers gradient variance and improves stability. A simplified variant TBA′ (based on PRIME‑RL) was tested on Qwen 2.5 7B for the MATH benchmark; TBA′ remained stable under a 10‑step off‑policy setting where Dr. GRPO exhibited noticeable oscillations.

Discussion

Using a trajectory‑level objective introduces higher gradient variance, which the authors mitigate by aggregating more responses per query. Consequently, TBA imposes stricter requirements on batch construction and sampling strategies.

References

Paper: Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post‑Training , arXiv:2503.18929.

Code:

https://github.com/bbartoldson/TBA

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM large-language-models reinforcement learning off‑policy asynchronous training trajectory balance

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.