Artificial Intelligence 12 min read

FinSearchComp: ByteDance’s Expert‑Level Financial Search and Reasoning Benchmark for Real‑World Scenarios

FinSearchComp is the first fully open‑source benchmark that evaluates large‑language‑model agents' search and reasoning abilities in realistic financial workflows, featuring 635 expert‑annotated questions across three task types, built with 70 finance experts, and revealing that web‑enabled models with financial plugins markedly outperform API‑only models.

Bighead's Algorithm Notes

Oct 30, 2025

FinSearchComp: ByteDance’s Expert‑Level Financial Search and Reasoning Benchmark for Real‑World Scenarios

Background – Existing financial benchmarks either focus on short‑fact retrieval (e.g., BrowseComp) or rely on pre‑collected contexts (e.g., FinQA), which bypass open‑domain search and fail to reflect analysts’ real‑time workflows. Financial analysis demands timely market data, structured historical disclosures, and unstructured news, making search and multi‑source evidence synthesis critical for LLM agents.

Problem Definition – Current benchmarks lack (1) expert‑level, multi‑task coverage of analyst workflows, (2) market‑specific differences (global vs. Greater China), and (3) rigorous evaluation of dynamic data, source conflicts, and error tolerance.

Method

3.1 Task Design – FinSearchComp defines three analyst‑grade tasks with increasing difficulty:

T1 – Time‑Sensitive Data Retrieval : answer real‑time queries such as “What was Nvidia’s closing price yesterday?” requiring freshness, multi‑source conflict resolution, and market‑time alignment.

T2 – Simple Historical Query : locate a specific historical fact (e.g., “Starbucks’ total assets on 2020‑09‑27?”) involving fiscal‑year vs. calendar‑year alignment, data revisions, and unit consistency.

T3 – Complex Historical Investigation : synthesize multi‑period data (e.g., “Which month between 2010‑2025 had the largest single‑month S&P 500 gain?”) demanding long‑range retrieval, corporate actions handling, and evidence aggregation.

3.2 Data Construction & Quality Control

Data Sources : T1 uses real‑time financial APIs; T2 draws from company filings, regulator sites, and professional databases; T3 combines multiple historical sources that require manual synthesis.

Expert Involvement : 70 finance experts (50 annotators, 20 senior reviewers) from top institutions (e.g., JPMorgan, CITIC Securities) authored and validated the questions.

Quality Measures : disambiguation of terms (fiscal vs. calendar year), explicit calculation standards (GAAP vs. non‑GAAP), answer tolerance (±0.1 %), triple‑source cross‑validation, and a blind‑review process where independent experts re‑answer and senior adjudicators resolve discrepancies.

3.3 Evaluation Protocol

Dynamic Answer Handling : T1 answers are fetched via APIs at evaluation time; T2/T3 use static reference answers.

LLM as Judge : an auxiliary LLM applies the expert‑defined scoring rules (numeric error tolerance, multi‑source conflict detection), achieving 95 % agreement with human scores.

Scoring Metric : binary 0‑1 classification – a response that satisfies all rules receives 1, otherwise 0.

Experiments

4.1 Setup – The benchmark contains 635 questions split into a global (English) subset and a Greater‑China (Chinese) subset. Twenty‑one models were evaluated, including web‑enabled products (e.g., Grok 4, GPT‑5‑Thinking, Doubao) and API‑only services (e.g., Gemini, DeepSeek‑R1). A human baseline of 50 finance experts, who did not participate in dataset construction, completed the tasks using search tools.

4.2 Main Results

Overall Performance : On the global subset, Grok 4 (web) achieved 68.9 % accuracy (expert ceiling 75.0 %); GPT‑5‑Thinking (web) followed at 63.9 %. On the Greater‑China subset, Doubao (web) led with 51.9 % (expert ceiling 88.3 %); Yuanbao‑DeepSearch‑R1 (web) was close at 52.5 %.

Model Type Gap : Web‑enabled products averaged 40.4 % accuracy, significantly higher than API‑only models (average 8.1 %), confirming the pivotal role of integrated search.

Task‑Difficulty Trend : Accuracy declines from T1 (average 40.8 % for web products) to T2 (average 29.0 %) to T3 (average 8.1 %). Financial plugins boost T1 by 31.9 % and T2 by 24.6 %.

Key Findings :

Models without search score 0 on T1, and rely on stale memorized data for T2/T3, often producing outdated or incorrect answers.

Integrating financial plugins (e.g., Yuanbao platform) markedly improves both timeliness and multi‑source verification.

Regional model advantages reflect training corpus and tool integration: U.S. models excel on the global subset, while Chinese models dominate the Greater‑China subset.

Common failure modes include shallow search (no plugin), using prior‑day prices, mis‑aligning fiscal vs. calendar years, and unit conversion errors.

4.3 Case Studies

Success : Grok 4 (web) queried a financial plugin to obtain Walmart’s real‑time price ($96.08), cross‑validated with Tencent Finance, and correctly answered a T1 query.

Failure : An unnamed model answered Apple’s 2021 fiscal‑year cash‑flow from memory (144,266 M USD) without verification, leading to a wrong result.

Complex Task Breakthrough : GPT‑5‑Thinking (web) retrieved Apple’s split data from nasdaq.com, cross‑checked opening ($127.58) and prior closing ($499.23) prices, and correctly computed the price change (‑371.65 USD) for a T3 investigation.

The authors conclude that search capability is essential for financial LLM agents, and that FinSearchComp provides a rigorous, expert‑level evaluation framework to guide future development.

large language models benchmark AI evaluation LLM agents financial search FinSearchComp

Written by

Bighead's Algorithm Notes

Focused on AI applications in the fintech sector

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.