FinRpt: A Multi‑Agent Framework for Automatic Generation and Evaluation of Stock Research Reports
FinRpt introduces a novel multi‑agent pipeline that builds a high‑quality stock research report (ERR) dataset from six financial data sources, defines a comprehensive 11‑metric evaluation suite, and demonstrates that supervised‑fine‑tuned and reinforcement‑learned LLM agents significantly outperform single LLM baselines in both accuracy and efficiency.
Background
Large language models (LLMs) have achieved strong results in many financial NLP tasks such as sentiment analysis and question answering, yet fully automated generation of equity research reports (ERRs) remains under‑explored due to data scarcity and the lack of suitable evaluation metrics.
Problem Definition
The paper defines the ERR generation task, which requires constructing a high‑quality dataset, designing a comprehensive evaluation system, and developing a dedicated multi‑agent framework to handle the complexity of report creation.
Method
ERR Generation Task Definition
Given a stock ticker and a research date t, the system collects recent information from six sources S = [O, F, A, N, P, M] (company info, financial indicators, announcements, news, historical prices, market index) and generates an ERR R that must contain six sections: financial analysis, news analysis, management analysis, risk analysis, investment potential, and recommendation.
Dataset Construction
Data collection module : Retrieves data from Yahoo Finance API (O), AKShare API (F, P, M), crawls Eastmoney for announcements (A) and Sina Finance for news (N). GPT‑4o‑mini summarizes announcements and news, then filters and deduplicates.
Pipeline : Selects 800 CSI800 stocks, covering 2024‑09‑03 to 2024‑11‑05 with weekly intervals. After collection, records lacking financial indicators, with fewer than two news articles, or with summaries shorter than 300 Chinese characters are discarded. The FinRpt‑Gen framework (using GPT‑4o agents) then generates ERRs, producing (s, t, S, R) pairs. A dataset‑enhancement module further refines reports via a rating calibrator, expert‑written corrections, and LLM polishing.
FinRpt‑Gen Multi‑Agent Framework
The framework consists of four layers:
Information extraction agents : news ranking (top‑10 impact news), revenue extraction, balance‑sheet extraction, cash‑flow extraction.
Information analysis agents : financial analysis (summarizes health, profitability, cash flow), news analysis (highlights impact), status analysis (management and development), risk analysis (integrates previous analyses).
Prediction agent : consumes analyses plus historical price P and market index M to predict investment potential and recommendation.
Training methods : Supervised fine‑tuning (SFT) on demonstration data D_{demo} with initial parameters θ_0, followed by reinforcement learning (RL) using the DAPO algorithm (α=0.6, β=0.2, γ=0.2) and a reward function that balances financial, news, company‑market‑industry (CMI), investment, risk, and writing quality.
Evaluation System
Two groups of metrics are used:
Basic metrics : completion rate, accuracy, ROUGE‑L, BERTScore, numeric rate.
LLM‑specific metrics : financial numbers (FN), news, CMI, investment, risk, writing quality. A Judge Agent (GPT‑4o) performs pairwise comparisons of model outputs to compute adjusted win rates.
Experiments
Setup
Open‑source models are accessed via the Ollama Python library; closed‑source models via official APIs. SFT runs on eight NVIDIA 3090 GPUs, RL on eight NVIDIA A100 GPUs. Evaluation uses 100 randomly sampled test samples from the FinRpt test set.
Main Results
Basic metric results : The multi‑agent FinRpt‑Gen outperforms single LLMs. Before SFT/RL, Gemini‑2.5Pro and GPT‑4o lead among closed‑source models. After SFT, FinRpt‑Gen surpasses both closed‑source models on almost all metrics, and RL further improves performance to the best level.
LLM evaluation results : Trained models achieve professional‑grade scores comparable to GPT‑4o and exceed all other strong baselines, especially on CMI, news, and FN metrics where they even beat the FinRpt‑Gen (GPT‑4o) variant.
Resource analysis : End‑to‑end ERR generation (from crawling to report) takes roughly 3–4 minutes per report.
Ablation study : Comparing FinRpt‑Gen (Qwen2.5‑7B‑Instruct‑SFT‑RL) with four variants shows the full framework markedly outperforms each ablated version, confirming the necessity of the financial extraction, news extraction, and three analysis agents.
Dataset Quality Study
Human Evaluation
30 randomly selected FinRpt ERRs and 30 expert‑written ERRs were rated by three senior financial analysts on four dimensions. Scores of FinRpt reports were very close to expert reports, indicating high data quality.
Case Study
A representative ERR from the FinRpt dataset demonstrates detailed quantitative financial indicators, forward‑looking strategic analysis, clear investment arguments, and well‑structured formatting, illustrating the dataset’s overall quality.
Conclusion
FinRpt provides the first open‑source benchmark for automatic ERR generation, a comprehensive 11‑metric evaluation suite, and a powerful multi‑agent framework that, after SFT and RL, achieves state‑of‑the‑art performance while remaining efficient enough for practical use.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
