Artificial Intelligence 12 min read

Is RL Dead in LLM Post-Training? MIT’s RandOpt Challenges Traditional Methods

The MIT‑CSAIL paper introduces RandOpt, a single‑step, gradient‑free, fully parallel post‑training algorithm that adds Gaussian noise to pretrained LLM weights and ensembles the results, achieving or surpassing PPO/GRPO performance by exploiting dense "neural thickets" that emerge as model scale grows.

Machine Learning Algorithms & Natural Language Processing

Mar 15, 2026

Is RL Dead in LLM Post-Training? MIT’s RandOpt Challenges Traditional Methods

In current large‑language‑model (LLM) development, the post‑training stage is usually seen as essential for endowing models with specific abilities, relying on reinforcement‑learning methods such as PPO, GRPO, RLHF, or evolutionary strategies (ES) that iteratively adjust weights via gradient descent.

MIT CSAIL researchers Yulu Gan and Phillip Isola challenge this view with a new method called RandOpt . RandOpt replaces iterative optimization with a single‑step random perturbation: it adds Gaussian noise to a pretrained model’s weights, creates many noisy copies, evaluates each on a small validation set, selects the top‑K models, and aggregates their predictions by majority vote.

The authors argue that pretrained weight spaces contain a dense collection of task‑specific experts—a phenomenon they term "Neural Thickets." Small models exhibit a "needle‑in‑a‑haystack" regime where good solutions are extremely sparse, requiring gradient‑based search. In contrast, large, well‑pretrained models form dense thickets of experts, allowing random sampling to quickly discover high‑performing solutions.

Two core metrics quantify this effect:

Solution density : the probability that a random perturbation improves performance, which follows a clear scaling law—larger models have higher density of quality solutions.

Solution diversity : perturbed models excel on different tasks, indicating that improvements are task‑specific rather than generic. The paper introduces a "spectral inconsistency" measure to capture this diversity, showing it rises monotonically with model size.

To illustrate the mechanisms, the authors conduct a 1‑D signal prediction experiment with multilayer perceptrons (MLPs). They identify three phases:

No pre‑training (needle‑in‑a‑haystack) : random noise has negligible effect.

Single‑task pre‑training (plateau) : performance caps on the trained task, with no diversity.

Mixed‑task pre‑training (thicket emergence) : only when the model sees diverse signals does the weight space develop a thicket of specialized experts.

RandOpt’s simplicity translates into strong empirical results. Across models ranging from 0.5 B to 8 B parameters (Qwen, Llama, OLMo‑3) and tasks covering mathematical reasoning (Countdown, GSM8K), code generation (MBPP), creative writing (ROCStories), and chemistry (USPTO), RandOpt matches or exceeds PPO, GRPO, and ES when using comparable FLOPs. It also reduces wall‑clock time dramatically: on a 200‑GPU GH200 cluster, training OLMo‑3‑7B‑Instruct with N=2000 and K=50 completes in 3.2 minutes, achieving 70 % accuracy on Countdown.

The method extends to vision‑language models. Perturbing only the language component of a 3 B‑parameter Qwen2.5‑VL‑Instruct model improves GQA visual‑reasoning accuracy by 12.4 %.

RandOpt does require K forward passes at inference, which can be costly for deployment. To mitigate this, the authors propose a distillation step: they generate thousands of reasoning trajectories from the top‑50 RandOpt models, select difficult samples where the base model errs, and fine‑tune the base model for two supervised rounds. On GSM8K, the distilled single model reaches 84.3 % accuracy, close to the 87.1 % of the full ensemble, while costing only ~2 % of RandOpt’s training budget.

Error‑attribution analysis on GSM8K shows that 19.0 % of RandOpt’s gain comes from fixing output‑format mismatches ("Format Thicket") and 12.3 % from genuine reasoning improvements ("Reasoning Thicket"), confirming the presence of distinct expert skills within the thicket.

The authors also observe analogous "Color Thickets" in text‑to‑image diffusion models (e.g., Stable Diffusion XL), where local regions of weight space preferentially generate images with specific palettes or styles, demonstrating the broader relevance of thicket phenomena.

Overall, RandOpt reveals that post‑training can be reframed as a selection and ensemble problem over a rich landscape of pretrained experts, rather than a gradient‑driven search for new capabilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM scaling law reinforcement learning ensemble post-training neural thickets RandOpt

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.