How ASDA Generates Structured Financial Reasoning Skills for LLMs Without Fine‑Tuning

The ASDA framework automatically creates modular, version‑controlled financial‑reasoning skill files by iteratively analyzing student model failures, clustering errors, and injecting structured guidance, achieving up to a 17.33‑point boost on arithmetic tasks and a 5.95‑point boost on non‑arithmetic tasks in the FAMMA benchmark, far surpassing prior zero‑training methods such as GEPA and ACE.

Bighead's Algorithm Notes
Bighead's Algorithm Notes
Bighead's Algorithm Notes
How ASDA Generates Structured Financial Reasoning Skills for LLMs Without Fine‑Tuning

Background

Financial reasoning poses unique challenges for general‑purpose large language models (LLMs) because it requires multi‑step quantitative computation combined with domain‑specific judgment, a combination not covered by pure math or knowledge benchmarks. Existing evaluations (e.g., FAMMA, FinBen) show that state‑of‑the‑art models achieve only 38‑45% overall accuracy across eight financial sub‑domains, and error analyses reveal systematic gaps in domain knowledge and program selection.

Traditional domain‑specific fine‑tuning is costly, locks knowledge into model weights, and depends on supervision resources that many regulated organizations lack. Recent zero‑training prompt‑optimization approaches (GEPA, ACE) improve performance only marginally because they optimize flat text strings without the modularity required for complex multi‑step reasoning.

Problem Definition

The paper addresses three concrete problems: (1) high cost and knowledge lock‑in of domain fine‑tuning; (2) inadequacy of existing zero‑training methods for multi‑step financial reasoning; (3) persistent domain‑knowledge gaps that cause models to misapply financial concepts or select inappropriate procedures.

Method

The ASDA (Automated Skill Distillation and Adaptation) framework operates in a teacher‑student architecture with two stages.

3.1 Skill Warm‑Up

3.1.1 Failure Analysis & Structured Annotation – The teacher model receives each student‑model error (question, wrong answer, reasoning trace, ground‑truth answer) and outputs a structured annotation containing error_type (chosen from ten predefined categories) and root_cause to capture the underlying knowledge gap.

3.1.2 Skill Library Organization – Annotated failures are clustered by (sub‑domain, error type). Each cluster becomes a skill file containing a concise description of the knowledge gap, a “when‑to‑use” condition, a step‑by‑step reasoning program, and an example or code template. A top‑level SKILL.md file maps sub‑domain keywords and failure patterns to skill file paths.

3.1.3 Skill Selection & Injection – At inference time, a selector reads the problem text, consults SKILL.md, matches relevant sub‑domains and patterns, and may load multiple skill files. The selected skills are dynamically injected into the student’s prompt to guide its reasoning.

3.2 Dual‑Phase Iterative Skill Refinement

3.2.1 Evidence Collection & Attribution – For each training question, two evaluations are run: with the current skill library K_t injected and without any skill. Results partition the set into Q_t^+ (correct with skill), Q_t^- (regression caused by skill), and Q_t^{gap} (still incorrect). The teacher attributes each failure to a single skill file, forming evidence sets per file.

3.2.2 Coverage Phase – For each skill file, the teacher diagnoses why its associated Q_t^{gap} cases fail (e.g., missing edge‑case program, overly narrow trigger). The teacher proposes refinements (new or updated patterns). A candidate update is accepted only if the recovery rate exceeds a coverage threshold τ_{cov}.

3.2.3 Safety Phase – The teacher also checks regression cases Q_t^-. Using Q_t^+ as a preservation constraint, the teacher suggests modifications that eliminate regressions while keeping correct behavior above a safety threshold τ_{safe}. After both phases, the updated library K_{t+1} feeds the next iteration.

Experiments

Benchmark – Evaluation uses FAMMA‑Basic (1945 questions from textbooks and professional exams) covering eight sub‑domains, split into arithmetic and non‑arithmetic subsets. The English subset (1378 questions) is stratified 60/40 by difficulty and question type.

Evaluation Protocol – Each question‑answer pair is processed independently. For multiple‑choice items, exact string matching is used; for open‑ended items, a language‑model judge (Qwen‑Max) assesses correctness. Arithmetic questions employ a chain‑of‑thought (PoT) program executed by Qwen‑Turbo, followed by a selection step mapping numeric outputs to answer choices.

Baselines – ASDA is compared against two leading zero‑training methods, GEPA and ACE, run under identical black‑box API constraints (no weight modification). Baseline prompts contain only the standard task description.

Implementation Details – All runs use temperature 0 for reproducibility. Qwen‑Max serves as the open‑ended judge; Qwen‑Turbo handles numeric PoT execution.

Results and Analysis

Main Results – On arithmetic tasks, Haiku 3.5 improves by 8.67 pp after warm‑up and 17.33 pp after two refinements; Haiku 4.5 (baseline 64.67 %) gains 5.99 pp. On non‑arithmetic tasks, Haiku 3.5 gains 2.78 pp (warm‑up) and 5.95 pp (second refinement); Haiku 4.5 gains 1.60 pp after refinement. GEPA and ACE achieve only marginal lifts, confirming the limitation of flat‑text optimization.

Iterative Refinement Effect – Warm‑up targets the most frequent failure patterns, delivering the first large jump. Subsequent refinements address residual errors, with accuracy rising from 49.67 % (warm‑up) to 58.33 % (second refinement) on arithmetic tasks. A third refinement causes regression, indicating over‑fitting to training‑set patterns.

Per‑Question‑Type Gains – Skill injection yields larger improvements on multiple‑choice questions (e.g., +14.39 pp for Haiku 3.5 arithmetic) than on open‑ended questions (+3.73 pp), because structured programs constrain the answer space more effectively.

Regression Issues – Sample analysis shows that injected skills sometimes induce over‑reasoning, altering correct baseline answers. Loading an entire skill bundle for a problem can destabilize already correct predictions.

Qualitative Example – In a arithmetic case where the baseline fails, the injected skill provides a concise program that correctly computes the result. The same skill also resolves seven additional fixed‑income questions, demonstrating cross‑problem reusability within a sub‑domain.

Self‑Teaching Ablation – When the student acts as its own teacher, Haiku 3.5 gains 6.33 pp (73% of the 8.67 pp gain with a stronger Sonnet 4.5 teacher). The remaining 2.34 pp reflects the teacher’s contribution, indicating that most improvement stems from the structured distillation process rather than superior teacher knowledge.

Cross‑Model Transfer – Applying Haiku 3.5‑derived skills to Haiku 4.5 causes a net regression of 2.33 pp, mainly due to a 6.21 pp drop on open‑ended questions, while multiple‑choice performance improves modestly. Generating skills per model yields the best results, suggesting skills are model‑specific remedies.

Discussion

The self‑teaching results show that ASDA’s gains arise primarily from externalizing latent domain knowledge through systematic failure enumeration, not from privileged teacher expertise. Skills therefore capture model‑specific failure patterns rather than universal domain facts; reusing a weaker model’s skills on a stronger model can hurt performance.

For regulated industries that rely on black‑box LLM APIs, ASDA offers a practical, auditable adaptation path: run the distillation pipeline once on a labeled domain dataset, version‑control the generated skill files, and regenerate them when the underlying model is upgraded. The approach excels when failure modes cluster cleanly (e.g., arithmetic reasoning with clear programs) and is less effective when errors are diffuse (non‑arithmetic tasks), where regression risk rises.

Limitations – Experiments are limited to the FAMMA benchmark and Claude‑series models; it remains unclear how error‑type taxonomy, skill format, and refinement dynamics transfer to other domains. OCR artifacts in the FAMMA texts may introduce spurious patterns that do not generalize to cleaner corpora.

LLM adaptationskill generationASDAbenchmark FAMMAfinancial reasoningzero‑training
Bighead's Algorithm Notes
Written by

Bighead's Algorithm Notes

Focused on AI applications in the fintech sector

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.