Alpha‑R1: Reinforcement‑Learning‑Driven Large‑Model Alpha Factor Selection

Alpha‑R1 integrates reinforcement learning with an 8‑billion‑parameter LLM to jointly process price and news data, creating context‑aware factor embeddings that outperform traditional quantitative and generic LLM baselines on CSI 300 and CSI 1000 portfolios, demonstrating robust alpha‑decay resistance and zero‑sample generalization.

Bighead's Algorithm Notes
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Alpha‑R1: Reinforcement‑Learning‑Driven Large‑Model Alpha Factor Selection

Background

Factor investing is a cornerstone of modern asset management, but the proliferation of factors (the “factor zoo”) and the emergence of large language models (LLMs) for extracting sentiment from news and reports have created a need for a unified framework that captures semantic interactions between numerical and textual signals in a non‑stationary market. Traditional numeric indicators and text signals are treated as independent modalities, lacking a mechanism to jointly reason about their dynamic influence on investment decisions.

Problem Definition

Integrate traditional numeric indicators and textual signals within a unified framework to capture their semantic interactions for better handling of uncertainty in non‑stationary markets.

Leverage the reasoning ability of large models for effective alpha‑factor selection while maintaining interpretability and stability.

Address the poor adaptability of existing methods to market regime shifts and the misalignment of generic LLMs with financial principles.

Method

Alpha‑R1 is an 8‑billion‑parameter inference model trained via reinforcement learning. The method consists of several components.

3.1 Data, Memory and Factor Baseline

Raw data abstraction Heterogeneous raw data are converted into structured text atoms. For each time step t, two complementary market descriptors are built: a price market descriptor S_{t}^{price} and a news market descriptor S_{t}^{news}.

Iterative memory construction An iterative‑memory pipeline aggregates weekly market summaries M_{w} by recursively updating a large model, producing a global market description M_{global} after the entire back‑test period.

Factor back‑testing Each factor i in the pool is back‑tested to obtain a performance vector P_{i} containing return, volatility and decay characteristics.

3.2 Factor and State Description

Factor semantic description Quantitative signals are mapped to structured semantic profiles. Using the global market description M_{global} and the factor‑specific back‑test results P_{i}, a large model generates a semantic configuration file for each factor.

Asset‑pool state description For each decision day t, the instantaneous market state is synthesized from the two atomic units S_{t}^{price} and S_{t}^{news}.

3.3 Alpha‑R1 Inference Model

At decision time t, the factor semantic description and the market state are concatenated into a high‑dimensional semantic context. Alpha‑R1 performs inference over this context and outputs the selected factor list A_{t}. The model can be interpreted as a context‑conditioned sparse linear model, where the LLM core activates or deactivates factors based on alignment between factor configurations and current market semantics.

3.4 Market‑Feedback Reinforcement Learning (GRPO)

Backbone model and stability Qwen3‑8B is used as the backbone to accelerate RL convergence and improve output consistency.

Multi‑component reward function The final reward is R_{final}=R_{adjusted}-P_{structural}, where R_{adjusted} combines market‑performance feedback and inference‑quality assessment, and P_{structural} penalises overly complex factor selections.

GRPO optimization Group Relative Policy Optimization computes normalized advantage estimates and probability ratios to update the policy.

3.5 Portfolio Construction and Execution

Slot‑rotation mechanism The total capital C is divided into H independent sub‑portfolios (slots). On each trading day only the slot with index k = t \bmod H is rebalanced; the remaining H-1 slots stay passive.

VWAP execution with constraints Execution prices are calculated using a volume‑weighted average price (VWAP) model, with additional constraints such as moving‑window limits, IPO exclusion and transaction‑cost modelling (0.1% bid‑ask spread).

Experiments

4.1 Experimental Setup

The factor pool is derived from Alpha101, retaining 82 computable factors. The timeline is split into pre‑training (2020‑01‑01 ~ 2023‑12‑31), training (2024‑07‑01 ~ 2024‑12‑31) and testing (2025‑01‑01 ~ 2025‑06‑30). Testing is performed on two asset pools: CSI 300 and CSI 1000.

Baselines include traditional quantitative strategies (PCA, XGBoost, LightGBM, A2C, PPO) and inference‑capable LLMs (Gemini 2.5 Pro Thinking, Claude 3.7 Sonnet Thinking, DeepSeek‑R1, Qwen3‑8B). A slot‑rotation horizon H=5 days and a top‑N selection of 10 stocks per slot are used; trades are executed with 30‑minute VWAP prices and a bilateral cost of 0.1%.

Performance metrics are cumulative return (CR), annualised return (AR), Sharpe ratio (SR) and maximum drawdown (MDD), averaged over five independent runs.

4.2 Performance Evaluation

In‑sample (CSI 300) Alpha‑R1 achieves the best results. Tree‑based models suffer in non‑stationary markets, numeric RL agents are sensitive to distribution shift, and generic LLMs lack domain‑specific financial knowledge. Alpha‑R1’s context‑conditioned sparse linear formulation yields stable, risk‑aware decisions.

Zero‑sample generalisation (CSI 1000) Alpha‑R1 maintains strong performance, whereas traditional RL agents show limited generalisation. The semantic understanding and market‑feedback loop enable dynamic factor adjustment under high volatility.

4.3 Ablation Study

Removing each key component (news, price, semantic description, RL optimisation) demonstrates that the combination of reinforcement‑learning alignment, semantic reasoning and multimodal signal integration is essential for Alpha‑R1’s superior risk‑adjusted returns.

4.4 Semantic vs Heuristic Gating

Semantic gating is compared with heuristic gates such as Lasso and IC momentum. Across the CSI 300 test set, semantic gating consistently outperforms heuristics, showing greater robustness to market regime changes.

4.5 Robustness and Generalisation Analysis

Parameter‑sensitivity experiments reveal stable performance across a range of settings. Transfer from CSI 300 to CSI 1000 demonstrates strong zero‑sample generalisation, delivering stable and profitable outcomes in different asset pools.

Conclusion

Alpha‑R1 demonstrates that a large‑model inference engine trained with reinforcement learning can effectively fuse price and news information, produce context‑aware factor selections, and achieve robust, superior performance compared with both traditional quantitative methods and generic LLM baselines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelreinforcement learningFinancial AImarket predictionalpha factor selection
Bighead's Algorithm Notes
Written by

Bighead's Algorithm Notes

Focused on AI applications in the fintech sector

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.