Artificial Intelligence 11 min read

Fin-PRM: Alibaba’s Dianjin Team Introduces a Domain-Specific Process Reward Model for Financial Reasoning

Fin‑PRM, a domain‑specific process reward model for financial reasoning introduced by Alibaba’s Dianjin team, employs dual‑level step and trajectory rewards to provide fine‑grained supervision, achieving up to 12.9% accuracy gains in supervised fine‑tuning and 5.1% improvements in Best‑of‑N inference on benchmarks such as CFLUE and FinQA.

Bighead's Algorithm Notes

Sep 11, 2025

Fin-PRM: Alibaba’s Dianjin Team Introduces a Domain-Specific Process Reward Model for Financial Reasoning

Background

Large language models (LLMs) demonstrate strong reasoning abilities, but financial applications such as report analysis, investment strategy formulation, and compliance assessment demand higher precision, factual correctness, and logical consistency. Existing process reward models (PRM) are generic or STEM‑focused and suffer from three main limitations in finance: lack of domain specificity, unreliable reward signals due to LLM‑as‑a‑Judge, and insufficient multi‑dimensional supervision.

Problem Definition

The paper identifies three core issues with current PRMs for financial reasoning: (1) they cannot capture the structured, symbolic nature of financial logic; (2) reward annotations rely on opaque LLM judges, making factual verification difficult; (3) they only supervise a single dimension (step or trajectory), ignoring both local step correctness and global logical coherence.

Method

Fin‑PRM addresses these problems by introducing a dual‑level reward supervision framework—step‑level and trajectory‑level—to evaluate financial reasoning with fine granularity.

Step‑level Reward Modeling

Step‑level reward R_{step} evaluates each reasoning step s_t through three components:

Importance score r_{importance}: Monte‑Carlo estimation using Qwen2.5‑7b‑math to generate N=8 continuations from s_t and compute the proportion of correct paths.

Quality score r_{qual}: Qwen3‑235b‑a22b assesses semantic coherence, logical soundness, and answer‑orientation, outputting a scalar in [0,1].

Accuracy score r_{acc}: Combines step correctness r_{step extunderscore correctness} and knowledge accuracy r_{knowledge extunderscore acc} with weight w_k=1.0.

The three scores are dynamically weighted via Softmax and binarized at a threshold of 0.5 to produce supervision labels.

Trajectory‑level Reward Modeling

Trajectory‑level reward R_{traj} evaluates the entire reasoning trace s via two components:

Result correctness r_{out}: strict 0/1 match between the final answer a and the ground‑truth answer y.

Knowledge coverage r_{cover}: proportion of knowledge base elements covered by the trace, computed with a knowledge extraction function \phi_{ext}.

The final trajectory reward is binarized at a threshold of 1.25.

Training Objective

Fin‑PRM jointly optimizes step‑level loss L_{step} (mean binary cross‑entropy over all steps) and trajectory‑level loss L_{traj} (binary cross‑entropy over the whole trace).

Data Construction

Using the OpenThoughts synthesis framework, 3,000 financial‑reasoning samples were generated from CFLUE (Chinese Financial Language Understanding Evaluation) and DeepSeek‑R1. Each sample contains question x, reasoning trace s=(s_1,…,s_T), answer a, knowledge base K, true answer y, and expert analysis y_{analysis}. Financial terms and definitions were extracted with Qwen3‑235b‑a22b to build K, and reward labels were produced by combining LLM‑as‑a‑Judge with knowledge verification.

Experiments

Experimental Setup

Datasets: CFLUE, FinQA, and Math500 (cross‑domain math reasoning). Baselines: generic PRM (Qwen2.5‑Math‑PRM‑7B/72B), random supervised fine‑tuning (SFT), majority voting (Best‑of‑N), and reinforcement learning with result‑based reward (GRPO).

Results

Supervised Fine‑Tuning (SFT) : Using Fin‑PRM to select 1,000 high‑quality samples for fine‑tuning Qwen2.5‑7B‑Instruct yields 58.2% accuracy on CFLUE, a +12.9% improvement over random selection (43.8%) and +1.1% over the generic PRM baseline (57.1%).

Best‑of‑N Inference : With N=16, Fin‑PRM‑guided Best‑of‑N improves CFLUE accuracy by 5.1% over majority voting and matches a specialized math PRM on Math500, demonstrating retained logical evaluation capability.

Reinforcement Learning (GRPO) : Incorporating Fin‑PRM rewards raises performance to 70.5% on CFLUE (+3.3% over rule‑based reward) and 62.8% on FinQA (+1.4% over generic PRM), confirming the guiding effect of process rewards for policy optimization.

Ablation Study : Adjusting the weight \zeta between step‑level and trajectory‑level rewards shows that a balanced setting \zeta=1.0 yields the highest Best‑of‑N accuracy (65.8% at N=16), validating the necessity of dual‑level supervision.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large language model reinforcement learning FinQA process reward model financial reasoning CFLUE Fin-PRM

Written by

Bighead's Algorithm Notes

Focused on AI applications in the fintech sector

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.