Artificial Intelligence 9 min read

MIT Study Shows Adding Noise to Large Models Can Replace GRPO/PPO Tuning

A new MIT paper reveals that pretrained large models already contain many hidden expert submodels, and that a simple one‑step Gaussian perturbation (RandOpt) can locate and ensemble these experts to achieve performance comparable to or better than traditional GRPO/PPO tuning, especially as model size grows.

Machine Learning Algorithms & Natural Language Processing

Mar 17, 2026

MIT Study Shows Adding Noise to Large Models Can Replace GRPO/PPO Tuning

MIT researchers propose that extensive multi‑task pre‑training already embeds many task‑specific expert submodels in the vicinity of a large language model’s weights.

Traditional fine‑tuning methods such as gradient descent or reinforcement‑learning‑based algorithms (GRPO, PPO) treat adaptation as a slow, iterative optimization. The new paper shows that a simple one‑step Gaussian perturbation can “discover” these experts without any gradient updates.

They introduce the RandOpt procedure:

Apply N random Gaussian perturbations (different sigma levels) to the pretrained model, producing N candidate models.

Evaluate each candidate on a small validation set and select the top K performers.

During inference, let the K selected models answer a query and aggregate the results by majority voting.

Experiments on the Qwen2.5 family (0.5 B–32 B parameters) involve 1 000 random perturbations per model and a 2‑D projection of the resulting performance landscape. Visualizations reveal that larger models have denser “high‑precision” regions (red zones) where many perturbations improve performance, whereas smaller models mostly degrade.

Key empirical findings:

RandOpt achieves accuracy comparable to or higher than GRPO/PPO on mathematical reasoning, code generation, story writing, and chemistry tasks for pure‑language models.

For vision‑language models, accuracy rises from 56.6 % to 69.0 %.

Similar “neural thicket” effects appear in image‑diffusion models, where certain weight regions bias generation toward specific styles.

Performance gains increase with model size and with the number of random perturbations.

Limitations noted by the authors include:

Reliance on high‑quality pre‑training; RandOpt cannot teach new skills beyond the pre‑training data.

Ensembling K models raises inference cost; distillation can mitigate but is not universally applicable.

The method works best for tasks with clear correct answers; open‑ended generation (e.g., story writing, molecular design) may suffer from “specialist” bias.

The paper’s authors, Yulu Gan (MIT CSAIL PhD student) and Phillip Isola (MIT EECS associate professor), provide the full manuscript (arXiv 2603.12228), code (https://github.com/sunrainyg/RandOpt), and project page (https://thickets.mit.edu/).

large language models model scaling GRPO PPO neural thicket random perturbation RandOpt

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.