LLMs Finally Derive Formulas: FunctionEvolve Boosts LLM‑SRBench Accuracy 3.6× and Scores Perfect on AI‑Feynman

The FunctionEvolve framework represents formulas as abstract syntax trees, letting large language models guide symbolic regression; this yields a 3.6‑fold improvement on the LLM‑SRBench benchmark (55.8% SA@1) and a perfect 120/120 score on AI‑Feynman, with detailed component ablations confirming the value of structure‑aware generation, selection, mutation, and optimization.

Machine Heart
Machine Heart
Machine Heart
LLMs Finally Derive Formulas: FunctionEvolve Boosts LLM‑SRBench Accuracy 3.6× and Scores Perfect on AI‑Feynman

Problem

Symbolic regression searches an astronomically large space of candidate formulas. Naïve error‑minimization often yields over‑fitted, non‑interpretable expressions, and balancing numerical accuracy, simplicity, interpretability and extrapolation ability is difficult.

FunctionEvolve Framework

FunctionEvolve represents each candidate formula as an abstract syntax tree (AST) and couples four modules:

Generator : an LLM reads the task description and produces a set of seed formulas aligned with domain knowledge.

Selector : candidates are clustered by structural similarity; the search budget is allocated to structurally diverse directions, avoiding repeated exploration of the same subtree.

Mutator : semantic suggestions from the LLM (e.g., “replace this term with a square‑inverse”) are applied as localized AST edits, preventing wholesale rewrites.

Optimizer : after a structure is fixed, linear coefficients are solved analytically; non‑linear parameters are searched within constrained ranges (e.g., phase of a trigonometric term), reducing false negatives caused by poor coefficient fitting.

Benchmarks

LLM‑SRBench (129 synthetic scientific equations) : FunctionEvolve achieves SA@1 = 72/129 (55.8 %) and SA@50 = 107/129 (82.9 %), a 3.6× improvement over the previous best (24/129). Using alternative back‑ends (GPT‑5.2 medium, DeepSeek‑V4‑Pro, Qwen3.6‑27B, Llama‑3.1‑8B) yields SA@50 of 103, 99, 86 and 62 respectively, confirming that the gain stems from the framework rather than any specific closed‑source model.

AI‑Feynman (120 real physics equations) : FunctionEvolve attains perfect SA@1 = 120/120, surpassing the prior state‑of‑the‑art QDSR (107/120). Analysis of the round in which the first correct formula appears shows that on AI‑Feynman most solutions appear in the initial round (indicating memorization), whereas on LLM‑SRBench correct formulas emerge in later rounds, confirming genuine reasoning.

Post‑search Candidate Filtering

Three filters are applied to the top‑5 candidates:

Pareto : balances NMSE and expression complexity via non‑dominated sorting.

Occam : prefers simpler formulas when training errors are comparable.

MDL : combines error and description length into a single cost.

Pareto and Occam each retain correct formulas for over 100 tasks, whereas ranking solely by NMSE retains them for only 89 tasks, demonstrating that many correct formulas are displaced by slightly lower‑error complex approximations.

Ablation Studies

Removing each component in turn (using Claude Opus 4.6 as the LLM backend) yields:

Without the LLM‑driven Mutator, SA@50 drops from 107 to 46, highlighting the importance of semantic guidance.

Without the structure‑aware Optimizer, SA@50 falls to 53, showing that coefficient‑fitting failures are a common error mode.

Without LLM visibility of the AST, SA@50 decreases to 60 (vs. 84 when only AST rule‑based mutation is removed), confirming that the AST interface is crucial for both generating meaningful edits and informing the LLM about formula complexity.

Insights

The AST serves as the interface between LLM semantic cues and symbolic search, enabling:

Targeted local modifications rather than wholesale rewrites.

Retention of useful sub‑structures across iterations.

Constrained coefficient search based on structural context (e.g., limiting phase search to a single period).

Experimental results demonstrate that coupling LLM semantic insight with explicit structural representations dramatically narrows the gap between numerical accuracy and symbolic equivalence in symbolic regression.

Resources

GitHub repository: https://github.com/Phoinikas03/FunctionEvolve
Paper (arXiv): https://arxiv.org/abs/2606.07704
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ASTLarge Language Modelssymbolic regressionAI‑FeynmanFunctionEvolveLLM‑SRBenchsemantic guidance
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.