FinRL‑DeepSeek: How Integrating DeepSeek with RL Improves Portfolio Returns (Code Open‑Source)

This article reviews a new risk‑sensitive trading agent that combines reinforcement learning with large language models to extract stock recommendations and news‑based risk scores, describes the extended CVaR‑PPO algorithm, presents extensive experiments on the FNSPID dataset, and discusses the resulting performance gains and future work.

Bighead's Algorithm Notes
Bighead's Algorithm Notes
Bighead's Algorithm Notes
FinRL‑DeepSeek: How Integrating DeepSeek with RL Improves Portfolio Returns (Code Open‑Source)

Background

Automated trading agents using reinforcement learning (RL) are increasingly common, but they often ignore alternative data sources such as financial news and lack explicit risk management. This paper proposes a risk‑sensitive RL agent that incorporates large language models (LLMs) to extract stock‑specific recommendation and risk scores from news.

Problem Definition

The goal is to address the shortcomings of RL agents by integrating financial‑news‑driven recommendation and risk assessment signals, thereby improving trading performance and handling market risk.

Method

Data and LLM Prompt

The FNSPID dataset (1999‑2023, 15.7 million time‑aligned news records) is down‑sampled to 2 million entries by randomly selecting one representative article per stock each day. Three LLMs—DeepSeek V3, Qwen 2.5 72B, and Llama 3.3 70B—are prompted to generate a recommendation score S_f and a risk score R_f for each stock‑day.

PPO Baseline

Standard Proximal Policy Optimization (PPO) is described with its objective L_{PPO}(\theta) = expectation of the clipped probability ratio

r_t(\theta) = \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}

multiplied by the advantage estimate A_t, where \epsilon clips large updates.

CVaR‑PPO

Conditional Value‑at‑Risk PPO (CVaR‑PPO) extends PPO with a risk constraint that penalises trajectories with large losses. Its objective adds a term \lambda \cdot \max(0, L_{CVaR}(\theta)-\eta) where L_{CVaR}(\theta) is the CVaR loss, \eta is the CVaR threshold, \lambda is a Lagrange multiplier, \alpha is the confidence level (e.g., 0.05 for the worst 5 %), and \beta is an auxiliary penalty parameter.

LLM‑Injected PPO (LLM‑PPO)

For each stock i on day t, the LLM‑derived recommendation score S_f^i modulates the PPO action: S_f > 1 amplifies the action, S_f < 1 attenuates it, and S_f = 1 leaves it unchanged. Parameter settings keep S_f close to 1.

LLM‑Injected CVaR‑PPO (CPPO)

The LLM‑derived risk score R_f^i is used to compute a total risk score that adjusts the trajectory return in CVaR‑PPO. The adjusted return incorporates the portfolio weight w_i of each stock, ensuring the sum of weights equals 1.

Experiments

Early Stopping (400‑500 k steps)

Two training regimes were tested. Setting 1 used data from 2019‑2022 with 500 k steps (25 epochs) and a 2023 test period; Qwen 2.5 showed that LLM‑based recommendations improve cumulative PPO returns but do not surpass the Nasdaq‑100 benchmark.

Setting 2 used data from 2013‑2018 with 400 k steps (20 epochs) and a 2019‑2023 test period; longer training yielded significant gains for both PPO and CPPO, yet PPO remained volatile. DeepSeek V3 performed slightly better than Llama 3.3, and in this configuration LLM integration degraded performance.

Performance After 2 M Training Steps

After 2 million steps, five variants (PPO, CPPO, PPO‑DeepSeek, CPPO‑DeepSeek) were evaluated over 100 epochs on metrics such as information ratio, CVaR, and Rachey ratio.

Both PPO‑DeepSeek and CPPO‑DeepSeek outperformed the baseline and the Nasdaq‑100 benchmark. PPO excelled in bull markets, while CPPO‑DeepSeek performed better in bear markets, with a turning point around the end of 2021.

Impact of LLM Injection Strength

Varying the injection intensity from 10 % down to 0.1 % showed that stronger LLM injection generally reduced PPO‑DeepSeek performance, even with minimal perturbations, whereas CPPO‑DeepSeek benefited from stronger injection.

Conclusion

The study introduces an LLM‑augmented RL agent for algorithmic trading that combines stock‑level recommendations and news‑derived risk assessments. Future work will focus on reducing RAM consumption for longer training, shortening decision horizons to react faster to market events, and improving the quality of news signals.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMDeepSeekPPORLAlgorithmic TradingCVaRFinRL
Bighead's Algorithm Notes
Written by

Bighead's Algorithm Notes

Focused on AI applications in the fintech sector

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.