Why Evolution Strategies Beat Reinforcement Learning for Large‑Model Fine‑Tuning
This article reviews the paper “Evolution Strategies at Scale: LLM Fine‑Tuning Beyond Reinforcement Learning”, explaining how parameter‑space exploration via ES provides more stable, sample‑efficient, and reproducible fine‑tuning for billion‑parameter LLMs such as Qwen‑2.5 and LLaMA‑3, and detailing the algorithmic and engineering innovations that make full‑parameter ES practical.
Research Background
Recent work on LLM post‑training has largely treated it as a reinforcement‑learning (RL) problem, inserting PPO or GRPO into pipelines and exploring in the action space. The paper Evolution Strategies at Scale: LLM Fine‑Tuning Beyond Reinforcement Learning revisits this assumption by moving exploration to the parameter space. It extends Evolution Strategies (ES) to full‑parameter fine‑tuning of billion‑scale models (Qwen‑2.5 and LLaMA‑3 families) and shows that ES is more stable, uses fewer samples, and requires almost no hyper‑parameter grid search.
From Basic ES to Scalable Full‑Parameter Implementation
Basic ES (Algorithm 1)
ES samples perturbations ε∼𝒩(0,σ²I) for the entire parameter vector θ, evaluates a reward R(θ+ε) for each perturbed model, and updates the parameters by a weighted average of the perturbations: θ←θ+α·σ·∑_{i=1}^N w_i·ε_i where the weights w_i are the normalized rewards (e.g., z‑score). This procedure does not require gradients, actor‑critic architectures, or advantage estimation.
Scalable Engineering (Algorithm 2)
Applying naïve ES to a model with billions of parameters would be prohibitive in memory and communication. The authors introduce seven engineering tricks that make ES practical at this scale:
Random‑seed replay: Store only the random seed for each perturbation; the RNG can be reset to reconstruct the exact noise, saving memory.
Process‑level parallel evaluation: Distribute perturbations across processes or devices, providing natural parallelism.
Layer‑wise "perturb‑evaluate‑restore": Perturb and evaluate one layer at a time, so peak memory depends on layer size rather than the whole model.
Reward z‑score normalization: Standardize rewards to remove scale differences across tasks and training stages.
Greedy decoding for evaluation: Use deterministic greedy decoding to eliminate variance from stochastic decoding.
Decomposed parameter update: Accumulate updates layer‑by‑layer, further reducing memory peaks.
Learning‑rate absorption: Absorb the learning‑rate into the update rule, simplifying scheduling and improving stability.
These tricks implement a "memory‑for‑time" trade‑off: each iteration perturbs and evaluates only a single layer while all perturbations run in parallel, enabling stable full‑parameter fine‑tuning of 1‑8 B models.
Behavior Metrics and KL Approximation
For tasks where the objective is behavioral (e.g., style or alignment) rather than pure accuracy, the paper evaluates two metrics:
Average reward (task success).
KL divergence to the base model (preservation of original capabilities). The KL is approximated with the Schulman (2020) formula, avoiding Monte‑Carlo sampling.
Verifiable "Simplicity" Reward
In the "simplicity" benchmark each question provides a shortest correct answer. The reward is higher when the model’s output length is closer to this shortest answer, encouraging concise yet correct responses and preventing reward‑hacking where shorter but wrong answers would be favored.
Experiments and Results
Symbolic reasoning: ES dramatically improves accuracy on small models (e.g., Qwen‑2.5‑0.5 B accuracy rises from ~0.3 % to 14.4 %). As model size grows, ES maintains its advantage, achieving higher accuracy with a single unified hyper‑parameter set, whereas RL methods require per‑model grid searches.
Sample efficiency: When aligning the x‑axis to total evaluation samples, ES reaches the same performance as RL with roughly 20 % of the samples, thanks to variance reduction from population averaging.
Behavior alignment: In the Reward‑KL plane ES dominates GRPO, attaining higher reward at lower KL without an explicit KL penalty, and it does not exhibit reward‑hacking.
Stability and reproducibility: Across 0.5 B–8 B models ES shows consistent convergence, low variance across runs, and robust performance without per‑run hyper‑parameter tuning.
Conclusion
The study demonstrates that exploring the parameter space with Evolution Strategies offers a viable, often superior alternative to reinforcement‑learning‑based post‑training. ES provides stable, sample‑efficient, and reproducible fine‑tuning for long‑horizon, reward‑sparse tasks, and its engineering advances make it practical for billion‑parameter LLMs.
Key take‑aways:
Parameter‑space exploration can be more scalable and robust than action‑space RL.
Seven engineering tricks enable full‑parameter ES on billion‑scale models.
ES achieves higher accuracy, better sample efficiency, and avoids reward‑hacking without KL constraints.
Paper: https://arxiv.org/abs/2509.24372
Code: https://github.com/VsonicV/es-fine-tuning-paper
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
