FinEval‑KR: Diagnosing Knowledge vs. Reasoning Gaps in Financial Large Language Models

FinEval‑KR, a new EMNLP2025 evaluation framework co‑authored by Shanghai University of Finance and Economics and Ant Group, separates knowledge coverage from logical reasoning to reveal why financial LLMs often hallucinate on calculation tasks, introduces KS, RS, and CS metrics, and ranks 18 state‑of‑the‑art models on a rigorously curated finance dataset.

AntTech
AntTech
AntTech
FinEval‑KR: Diagnosing Knowledge vs. Reasoning Gaps in Financial Large Language Models

FinEval‑KR is a financial domain evaluation framework for large language models (LLMs) introduced in a paper accepted at EMNLP2025, a top conference in natural language processing and AI. The authors observed that while modern LLMs such as ChatGPT and DeepSeek can recall financial concepts, they frequently produce nonsensical answers when required to perform concrete calculations or multi‑step reasoning.

Core Problem

The central issue is that existing evaluations only check whether the final answer is correct, without revealing whether errors stem from missing knowledge ("didn’t learn") or from faulty reasoning ("can’t think").

FinEval‑KR Framework

FinEval‑KR introduces a three‑stage diagnostic mechanism that leverages an AI model’s self‑reflection ability to decouple knowledge and reasoning:

Stage 1 (Question Answering): The target model answers the question without any hints.

Stage 2 (Knowledge‑augmented Answering): A stronger “judge” model checks the answer; if it is wrong, the judge points out the missing knowledge.

Stage 3 (Error Diagnosis): The identified knowledge gaps are fed back to the target model for a corrected answer.

Three core metrics are defined to quantify the model’s abilities:

Knowledge Score (KS): Measures the breadth of financial knowledge covered.

Reasoning Score (RS): Evaluates pure logical reasoning after knowledge gaps are filled.

Cognitive Score (CS): A fine‑grained breakdown of RS based on Bloom’s taxonomy, with penalty coefficients for lower‑level cognition.

Dataset Construction

The FinEval‑KR dataset contains 9,782 high‑quality finance questions derived from nine classic textbooks (e.g., Ross’s *Corporate Finance* and Hull’s *Options, Futures, and Other Derivatives*), covering 22 sub‑domains. Each item is annotated with a Bloom‑taxonomy level (Remember, Understand, Apply, Analyze, Evaluate) and a detailed solution trace.

{
    "instruction": "在2020年3月新冠疫情爆发之初,虽然当时美国的通货膨胀率保持在目标水平2%,但美联储出于对经济下行风险的担忧,决定将联邦基金利率从1.5%大幅下调至0.25%。假设此时的均衡实际利率(r*)为0.5% ...",
    "gt": "步骤一:根据泰勒规则公式计算联邦基金利率:
[i = r* + \pi + 0.5 (\pi - \pi^*) + 0.5 (Y - Y^*)] ...",
    "point": "泰勒规则 名义利率计算 货币政策宽松 实际利率决策 经济下行风险。",
    "per_step": "{'步骤一': '记忆 理解', '步骤二': '应用', '最终答案': '评价'}",
    "classification": "金融",
    "subcategory": "货币金融学"
}

Experimental Findings

Running the three‑stage pipeline on 18 mainstream LLMs revealed two key truths:

Knowledge is the foundation: Providing formulas (Stage 3) to a strong model like GPT‑4o drops accuracy from 92.5% to 64.5%, showing that many errors are due to forgotten formulas.

Reasoning is the ceiling: Even when all necessary knowledge is supplied, models still make mistakes (e.g., Qwen2.5‑7B has a 14.1% error rate), indicating that logical reasoning limits performance.

Based on KS, RS, and CS, models were grouped into four performance tiers:

Tier 1 – Financial reasoning “gods” (e.g., DeepSeek‑R1, Gemini‑2.5‑Pro, DeepSeek‑V3) with RS > 0.90 and KS > 0.94.

Tier 2 – Generalist “all‑rounders” (e.g., GPT‑4.1, Claude‑3.7‑Sonnet, GPT‑4o) with 0.80 < RS < 0.90.

Tier 3 – Specialized or mid‑size models (e.g., Xuanyuan‑FinX1‑preview, Qwen‑max) with 0.70 < RS < 0.80.

Tier 4 – Basic or lightweight models (e.g., Fin‑R1‑7B, GPT‑3.5‑Turbo) with RS < 0.70.

Notably, open‑source models like DeepSeek‑R1 outperform many closed‑source giants on reasoning, while specialized finance models still lag behind due to weaker logical capabilities.

Conclusions

FinEval‑KR demonstrates that:

Reasoning is king: Enhancing logical inference via reinforcement learning is essential for financial AGI.

The “application” gap: CS (Cognitive Score) at the Apply level remains the biggest bottleneck; future work should focus on tool use and precise numeric computation.

General models now surpass domain‑specific ones: Multi‑task, multi‑domain training yields stronger reasoning that even specialized fine‑tuning cannot match.

FinEval‑KR thus serves both as a benchmark leaderboard and a diagnostic mirror, highlighting where current LLMs are strong and where they need improvement to become true financial experts.

FinEval‑KR illustration
FinEval‑KR illustration
FinEval‑KR framework
FinEval‑KR framework
Three‑stage diagnosis
Three‑stage diagnosis
Metric hierarchy
Metric hierarchy
Model ranking chart
Model ranking chart
Tier breakdown
Tier breakdown
Model comparison
Model comparison
Performance scatter
Performance scatter
FinEval‑KR dataset example
FinEval‑KR dataset example
End of document
End of document
finance AILLM evaluationmodel benchmarkingKnowledge vs reasoning
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.