Artificial Intelligence 14 min read

Paper Review: AlphaBench – Benchmarking LLMs for Formalized Alpha‑Factor Mining

The article reviews AlphaBench, the first benchmark suite for assessing large language models in formalized alpha‑factor mining (FAFM), detailing its three core tasks—factor generation, evaluation, and search—along with experiments on various commercial and open‑source LLMs that reveal strong potential but challenges in robustness, efficiency, and practical usability.

Bighead's Algorithm Notes

Mar 11, 2026

Paper Review: AlphaBench – Benchmarking LLMs for Formalized Alpha‑Factor Mining

Background

In quantitative investing, alpha factors are mathematical expressions that extract predictive signals from financial data. Traditionally, factor design relied on expert intuition and iterative backtesting, limiting adaptability. Recent attempts use machine learning methods such as reinforcement learning, genetic programming, and symbolic regression, but these require substantial engineering and compute resources.

The emergence of large language models (LLMs) offers a new paradigm for formalized alpha‑factor mining (FAFM) because of their strong symbolic reasoning, code generation, and formula synthesis capabilities. However, the performance of LLMs across different FAFM tasks and configurations remains unclear, and no standardized benchmark exists.

Problem Definition

Absence of a standardized benchmark for evaluating LLMs in FAFM.

Uncertainty about how different LLM settings (model type, prompting paradigm, inference strategy) affect FAFM performance.

Lack of insight into the strengths and limitations of LLMs across FAFM tasks (factor generation, factor evaluation, factor search).

Method

AlphaBench introduces three core tasks that reflect the typical workflow of quant researchers:

Factor Generation : The model receives a high‑level description (e.g., “momentum” or “mean reversion”) and produces a complete formula using allowed variables and operators (Text2Alpha). It can also perform directed mining to generate a set of factors based on a specific theme such as volatility‑driven signals.

Factor Evaluation : The model ranks candidate factors to select the top‑k and assigns an absolute score (e.g., estimated information coefficient, Sharpe ratio, or a qualitative rating).

Factor Search : The model iteratively explores the factor combination space using three representative search paradigms:

Chain‑of‑Experience (CoE): iteratively improves candidates by leveraging the current best solution and the historical exploration trajectory.

Tree‑of‑Thought (ToT): starts from seed factors, expands multiple candidates, prunes according to rules, and recursively grows the tree.

Evolutionary Algorithm (EA): performs steady‑state evolution on a fixed‑size pool, generating new candidates via mutation and crossover.

LLM Settings and Cost

The study evaluates two groups of models:

Commercial models: Gemini family and GPT family.

Open‑source models: DeepSeek series, LLaMA 3.1 series, and Qwen series.

Two prompting methods are compared: a standard prompt and a chain‑of‑thought (CoT) prompt that forces the model to articulate intermediate reasoning steps. For the search task, different temperature values and search‑step counts are examined to assess their impact on efficiency and quality.

Cost estimation (using DeepSeek‑V3 as an example) shows a total token consumption of roughly 5.5 million tokens across all tasks, including about 4.2 million tokens from distinct prompts. Models that support prompt caching achieve an 85 % cache‑hit rate, substantially reducing repeated costs.

Experimental Setup

Dataset: The Alpha158 factor pool from Qlib serves as the initial candidate set, and daily price data of CSI‑300 constituents from 2020‑2025 provides real‑world market conditions covering bull, bear, and sideways phases.

Evaluation Metrics:

Factor Generation : reliability (producing executable code), stability (consistent outputs), and accuracy (alignment with user intent).

Factor Evaluation : ranking ability (selecting the best factors) and scoring quality (assigning meaningful quality grades).

Factor Search : cost (tokens and steps required to discover a factor) and quality (performance improvement of the discovered factor over the original).

Results

Factor Generation

Most models achieve high reliability, but accuracy drops sharply as instruction difficulty increases. Larger commercial models generally outperform smaller open‑source models, which obtain lower overall scores. CoT prompting yields modest improvements for smaller models but can reduce stability for larger models.

Factor Evaluation

Nearly all LLMs perform poorly on the evaluation task, often failing to serve as reliable factor evaluators. CoT prompting does not consistently improve performance and can even degrade some metrics. Open‑source models lag behind, while Gemini‑2.5‑Flash shows relatively balanced signal accuracy and metric stability.

Factor Search

A clear trade‑off emerges between search quality and cost efficiency. Gemini‑2.5‑Pro delivers strong raw performance but incurs high token costs; GPT‑5 achieves a better balance of effectiveness and efficiency; medium‑sized models occupy a middle ground; smaller or weaker open‑source models fall behind.

Supplementary Findings

In the generation task, simple settings yield high reliability, yet accuracy declines with difficulty; CoT impact varies by model and task.

For evaluation, ranking results are close to random baselines, and CoT offers highly variable influence. In scoring, signal‑direction accuracy remains near random, and CoT provides limited calibration benefit.

In atomic evaluation (signal classification vs. pairwise selection), all models struggle to identify noisy factors; CoT shows little effect. Pairwise selection sees GPT‑5 excel, and supervised fine‑tuning (SFT) markedly improves pairwise evaluation while harming noise classification.

Search experiments reveal Gemini‑2.5‑Flash excels across multiple metrics. Lower temperature values typically yield more stable performance, whereas higher temperatures increase diversity at the expense of efficiency. The EA‑20 configuration (generating 20 candidates per round) attains the best balance among cost, diversity, and improvement.

Conclusion

AlphaBench demonstrates that LLMs possess strong potential for automating alpha‑factor mining, yet challenges remain in robustness, search efficiency, and real‑world applicability. Larger commercial models tend to perform better, but careful prompt design and inference configuration are essential, especially for smaller models.

LLM large language models benchmark quantitative finance AlphaBench factor mining FAFM

Written by

Bighead's Algorithm Notes

Focused on AI applications in the fintech sector

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.