FinQA Competition Winning Model by Ant Risk AI: Architecture, Dataset, and Experimental Results
Ant Risk AI’s team secured the FinQA competition champion by presenting a comprehensive model that combines a retriever and program generator, detailed dataset analysis, domain-specific language design, and extensive experiments demonstrating superior execution and program accuracy on financial numerical reasoning tasks.
Background Financial statements contain massive amounts of data, making it difficult to automatically analyze a company's financial health. The FinQA dataset was introduced to address this challenge by providing a question‑answering benchmark that includes both tabular data and textual explanations, along with annotated calculation steps.
Dataset FinQA consists of 6,251 training, 883 validation, 1,147 test, and a private test set without answers. Each example contains heterogeneous tables and text, and requires six arithmetic operations (add, subtract, multiply, divide, greater, exp) plus four table‑aggregation operations (table‑max, table‑min, table‑sum, table‑average). Statistics show that 23.42% of questions can be answered using only text, 62.43% using only tables, and the remaining 14.15% need both. Most calculations involve a single step, and the most frequent operations are divide (45.29%) and subtract (28.20%).
Domain Specific Language (DSL) The FinQA paper defines a DSL composed of the aforementioned arithmetic and table‑aggregation operations. A program is a sequence of steps, each step may reference previous results using the token #i (where i is the step index). The DSL enables explicit, interpretable reasoning traces.
Evaluation Two metrics are used: execution accuracy (whether the final numeric answer matches) and program accuracy (whether the generated program matches the reference). Execution accuracy can be inflated by “lucky” correct numbers, while program accuracy can be deflated when multiple correct programs exist. The true model performance lies between these two scores.
Equality of symbolic expressions is checked with sympy.simplify , so programs that are mathematically equivalent are considered equal.
Distribution Shift Adversarial validation revealed a distribution gap between the test set and the other splits (AUC 0.85 on test vs 0.91 on train). To mitigate this, the authors re‑split the data, reducing the gap between validation and test.
Model Overview The system consists of two stages: a retriever that selects sentences containing the required numbers, and a program generator that produces the calculation steps.
Retriever
Baseline – Concatenates the question with each candidate sentence and feeds it to BERT for binary classification. Positive samples are sentences containing the needed numbers; negatives are sampled at a 1:3 ratio. Scores are used to rank sentences.
Prompt‑learning model – Uses OpenPrompt with a MixedTemplate and ManualVerbalizer on a T5‑large backbone, again sampling negatives 1:3. The same ranking strategy as the baseline is applied.
Context model – Concatenates the question with up to n consecutive sentences separated by [SEP] tokens (e.g., [CLS] Question [SEP0] Sentence1 [SEP1] Sentence2 … ). Two binary classifiers predict whether the sentence before or after each [SEP] contains the required number. A sliding window of size 8 with stride 4 is used during training.
Ensemble – The three retriever models are combined via a logistic‑regression classifier trained on the validation set, using their sentence scores as features.
Program Generator
Baseline – Mirrors the FinQA paper: an encoder (BERT/RoBERTa) feeds token embeddings to an LSTM decoder with attention over the encoder output and decoding history. Training freezes the encoder for a few epochs, then fine‑tunes jointly. Gradient clipping, warm‑up, linear scheduler, and AdamW with weight decay are used.
Remove‑Redundancy model – To reduce over‑parameterization on the small dataset, this model only feeds numeric values and table row names (the elements needed for aggregation) to the decoder, ignoring other text tokens.
Transformer decoder model – Replaces the LSTM with a 4‑layer transformer decoder, keeping the same encoder. Attention over encoder outputs guides the next token prediction, and parallel training speeds up convergence.
Ensemble – Predictions from the three generators are combined by voting, weighted by each model’s execution accuracy on the private test set. Two voting schemes were explored (whole‑program vote vs step‑wise vote); whole‑program voting yielded the best results.
{"placeholder":"fact"} Question: {"placeholder":"question"} {"soft": "Is the relevant data of the problem in the previous article ?"} The answer was {"mask"} .
Experimental Results
Retriever – Trained on the training split, the best model was selected on validation and then evaluated on test. Recall@k tables are reported (figures omitted for brevity). The ensemble (stack) improves recall over individual models.
Program Generator – Trained on data retrieved by both stack and stack‑positive strategies. On validation and private test, the ensemble achieves 71.93% execution accuracy and 67.03% program accuracy, improving over single models by ~4%.
Ablation – Removing the adversarial‑validation re‑split degrades performance, confirming that mitigating distribution mismatch between validation and test is beneficial.
Conclusion The Ant Risk AI team achieved first place in the FinQA competition with a combined retriever‑generator architecture and model ensemble, yet there remains a substantial gap to human expert performance (91.16% execution accuracy, 87.49% program accuracy). This highlights the need for further research on numerical reasoning in NLP.
AntTech
Technology is the core driver of Ant's future creation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.